Online Modeling and Tuning of Parallel Stream Processing Systems
From AcaWiki
Citation: Jonathan Curtis Beard (2015/15/08) Online Modeling and Tuning of Parallel Stream Processing Systems. Engineering and Applied Science Theses & Dissertations (RSS)
DOI (original publisher): 10.7936/K7W37TKJ
Semantic Scholar (metadata): 10.7936/K7W37TKJ
Sci-Hub (fulltext): 10.7936/K7W37TKJ
Internet Archive Scholar (search for fulltext): Online Modeling and Tuning of Parallel Stream Processing Systems
Download: https://openscholarship.wustl.edu/eng etds/125/
Tagged: Computer Science
(RSS) parallel systems (RSS)
Summary
Placeholder
Elsewhere
Introduction (Ch. 1)
- Power wall and memory wall drive developers towards parallel heterogeneous systems.
- Other kinds of parallelism (SIMD, ILP, compiler automatic parallelization, explicit threading) won't necessarily help.
- Parallel programming in general is hard, but stream-processing could be an easy way of using parallelism.
- Need a framework for stream-processing that adapts to the workload (RaftLib).
Contributions
- RaftLib
- Apply gain/loss maximum flow to a Jackson queuing network.
- Show that the mLevy distribution is good for execution times.
- Time kernels in isolation.
- Use ML techniques to switch between models.
- Dynamically optimize application at runtime.
Background and Related Work (Ch. 2)
- Lots of prior work on dataflow processors. Only tangentially relevant.
- Lots of prior work on dataflow frameworks (Apache Storm, Apache Beam), but they have a centralized data broker.
- ScalaPipe and StreamJIT exist, but not for C++.
- Kendal notation
- RaftLib has online analysis, unlike perf.
- RaftLib has queueing-related metrics unlike DTrace, Pin, Valgrind, paradyn, Scalasca.
- JITs only take into account local not system-level information.
RaftLib Streaming Library (Ch. 3)
- Write parallel programs without worrying about races.
- Different kernels gives you task-parallelism.
- Duplicating kernels gives you data-parallelism.
- Chooses library-in-C++ to have wide applicability.
- Application split into kernels, which can read and write to named event streams.
- In general, mapping kernels to compute cores is NP-hard. Instead of optimizing outright, use heuristics.
- Local scheduling is round-robin. Future work: improve.
- Optimal queue size: branch-and-bound or analytic modeling?
- RaftLib measures: queue occupancy, service rate, and throughput
- Supports SIMD processing in kernels.
- Kernel doesn't care if stream is over TCP or in-memory.
- Performance is favorable compared to GNU parallel and Apache Spark
- However, they didn't use the same algorithm/implementation in each case.
- Where is the speed-up coming from?
Modeling Streaming Applications (Ch. 4)
- Use an M/M/1 Queueing model.
- Flow is conserved.
- Information gain/loss is conserved.
- Flow must not be greater than capacity.
- Volume in equals volume out.
- Queues cannot be infinite.
- Other assumptions
- Assume that two kernels scheduled on the same core take twice as long to execute.
- This may not be true if hyperthreading is enabled.
- Application is in equilibrium/steady-state.
- Data volume in/out is measurable.
- Kernels are non-blocking.
- Data-routing does not depend on the system's state.
- All kernels are work conserving (same amount of work on all processors).
- Assume that two kernels scheduled on the same core take twice as long to execute.
- Validation:
- Synthetic applications: randomly generated DAGs
- Real applications: JPEG and triple-DES
Best Case Execution Time Variation (Ch. 5)
- Distribution of best-case time:
- Prior work assumes Gaussian or Gumbel distribution for best-case
- Levy distribution seems to fit better, empirically. It is not perfect, but better than Gaussian.
- But Levy distribution has infinite moments, so truncate Levy distribution.
- How to time?
- OS wall-clock: too slow
- Why not just scale up workload?
- Timestamp counter: assume no migrations, turn off DVFS
- Use timestamp counter inside a thread, not on the critical path.
- "Frequency-scaling is turned off for the time update thread" How can they disable DVFS for just that thread?
- OS wall-clock: too slow
Dynamic Instrumentation (Ch. 6)
- Trying to measure important quantities without disrupting execution-time?
- Their kernels are really tiny, ~10us, so have to be careful. Why not use bigger kernels?
- Measurements:
- Throughput: Sample queue reads/writes over time.
- Queue size: Sample periodically.
- Service rate
- Validated against real applications.
Model Selection (Ch. 7)
- Decide between M/M/1 and M/D/1 with a neural net or SVM.
Online Tuning: Putting It All Together (Ch. 8)
- Adaptation types: change buffer size, modify placement of compute kernels, duplicate kernels
- If queue is over threshold consistently, increase queue size, and duplicate kernel.
- If queue is under threshold consistently, decrease.
- Randomly move kernels.
- Initial allocation of kernels to hardware is done by Scotch partitioning framework, minimizing communication.
- These online adaptations get close to optimal performance, testing against exhaustive search. However, exhaustive search had small bounds to ensure its termination.
Conclusions and Future Work (Ch. 9)
- More complex execution modeling (cache awareness)
- Compressed buffers
- Compute in memory