Online Modeling and Tuning of Parallel Stream Processing Systems

From AcaWiki
Jump to: navigation, search

Citation: Jonathan Curtis Beard (2015/15/08) Online Modeling and Tuning of Parallel Stream Processing Systems. Engineering and Applied Science Theses & Dissertations (RSS)
DOI (original publisher): 10.7936/K7W37TKJ
Semantic Scholar (metadata): 10.7936/K7W37TKJ
Sci-Hub (fulltext): 10.7936/K7W37TKJ
Internet Archive Scholar (search for fulltext): Online Modeling and Tuning of Parallel Stream Processing Systems
Download: etds/125/
Tagged: Computer Science (RSS) parallel systems (RSS)




Introduction (Ch. 1)

  • Power wall and memory wall drive developers towards parallel heterogeneous systems.
  • Other kinds of parallelism (SIMD, ILP, compiler automatic parallelization, explicit threading) won't necessarily help.
  • Parallel programming in general is hard, but stream-processing could be an easy way of using parallelism.
  • Need a framework for stream-processing that adapts to the workload (RaftLib).


  1. RaftLib
  2. Apply gain/loss maximum flow to a Jackson queuing network.
  3. Show that the mLevy distribution is good for execution times.
  4. Time kernels in isolation.
  5. Use ML techniques to switch between models.
  6. Dynamically optimize application at runtime.

Background and Related Work (Ch. 2)

  • Lots of prior work on dataflow processors. Only tangentially relevant.
  • Lots of prior work on dataflow frameworks (Apache Storm, Apache Beam), but they have a centralized data broker.
  • ScalaPipe and StreamJIT exist, but not for C++.
  • Kendal notation
  • RaftLib has online analysis, unlike perf.
  • RaftLib has queueing-related metrics unlike DTrace, Pin, Valgrind, paradyn, Scalasca.
  • JITs only take into account local not system-level information.

RaftLib Streaming Library (Ch. 3)

  • Write parallel programs without worrying about races.
    • Different kernels gives you task-parallelism.
    • Duplicating kernels gives you data-parallelism.
  • Chooses library-in-C++ to have wide applicability.
  • Application split into kernels, which can read and write to named event streams.
  • In general, mapping kernels to compute cores is NP-hard. Instead of optimizing outright, use heuristics.
  • Local scheduling is round-robin. Future work: improve.
  • Optimal queue size: branch-and-bound or analytic modeling?
  • RaftLib measures: queue occupancy, service rate, and throughput
  • Supports SIMD processing in kernels.
  • Kernel doesn't care if stream is over TCP or in-memory.
  • Performance is favorable compared to GNU parallel and Apache Spark
    • However, they didn't use the same algorithm/implementation in each case.
    • Where is the speed-up coming from?

Modeling Streaming Applications (Ch. 4)

  • Use an M/M/1 Queueing model.
    • Flow is conserved.
    • Information gain/loss is conserved.
    • Flow must not be greater than capacity.
    • Volume in equals volume out.
    • Queues cannot be infinite.
  • Other assumptions
    • Assume that two kernels scheduled on the same core take twice as long to execute.
      • This may not be true if hyperthreading is enabled.
    • Application is in equilibrium/steady-state.
    • Data volume in/out is measurable.
    • Kernels are non-blocking.
    • Data-routing does not depend on the system's state.
    • All kernels are work conserving (same amount of work on all processors).
  • Validation:
    • Synthetic applications: randomly generated DAGs
    • Real applications: JPEG and triple-DES

Best Case Execution Time Variation (Ch. 5)

  • Distribution of best-case time:
    • Prior work assumes Gaussian or Gumbel distribution for best-case
    • Levy distribution seems to fit better, empirically. It is not perfect, but better than Gaussian.
    • But Levy distribution has infinite moments, so truncate Levy distribution.
  • How to time?
    • OS wall-clock: too slow
      • Why not just scale up workload?
    • Timestamp counter: assume no migrations, turn off DVFS
      • Use timestamp counter inside a thread, not on the critical path.
      • "Frequency-scaling is turned off for the time update thread" How can they disable DVFS for just that thread?

Dynamic Instrumentation (Ch. 6)

  • Trying to measure important quantities without disrupting execution-time?
    • Their kernels are really tiny, ~10us, so have to be careful. Why not use bigger kernels?
  • Measurements:
    • Throughput: Sample queue reads/writes over time.
    • Queue size: Sample periodically.
    • Service rate
  • Validated against real applications.

Model Selection (Ch. 7)

  • Decide between M/M/1 and M/D/1 with a neural net or SVM.

Online Tuning: Putting It All Together (Ch. 8)

  • Adaptation types: change buffer size, modify placement of compute kernels, duplicate kernels
    • If queue is over threshold consistently, increase queue size, and duplicate kernel.
    • If queue is under threshold consistently, decrease.
    • Randomly move kernels.
    • Initial allocation of kernels to hardware is done by Scotch partitioning framework, minimizing communication.
  • These online adaptations get close to optimal performance, testing against exhaustive search. However, exhaustive search had small bounds to ensure its termination.

Conclusions and Future Work (Ch. 9)

  • More complex execution modeling (cache awareness)
  • Compressed buffers
  • Compute in memory