The potential for using thread-level data speculation to facilitate automatic parallelization

From AcaWiki
Jump to: navigation, search

Citation: Gregory Steffan, Todd Mowry (1998/02/01) The potential for using thread-level data speculation to facilitate automatic parallelization. IEEE Symposium on High-Performance Computer Architecture (RSS)
DOI (original publisher): 10.1109/HPCA.1998.650541
Semantic Scholar (metadata): 10.1109/HPCA.1998.650541
Sci-Hub (fulltext): 10.1109/HPCA.1998.650541
Internet Archive Scholar (search for fulltext): The potential for using thread-level data speculation to facilitate automatic parallelization
Download: https://ieeexplore.ieee.org/document/650541
Tagged: Computer Science (RSS) Computer Architecture (RSS)

Summary

Applications need to exploit parallelism to unlock performance on multicore architectures. Automatically parallelizing compilers are ineffective because they can rarely prove two iterations of a loop are statically independent. Instead; the authors present Thread-Level Data Speculation (TLDS) which exploits that parallelism for "free" with architectural and compiler support. The compiler needs only know that the two iterations are probably independent. The two iterations proceed in parallel, sepculating that they are independent, and the architecture supports validating and squashing efficiently in the case where they are not.

Theoretical and Practical Relevance

Instruction-level parallelism is more often extracted through out-of-order superscalar execution rather than TLDS. Perhaps TLDS can be used in concert with these to extract even more ILP.


Problem

  1. In the future, chips will have more cores but not necessarily faster cores.
  2. Applications need to exploit parallelism to take advantage of futuristic chips.
    1. Rewriting old applications to support parallelism is tedious and error-prone; Even new applications are often not written with scalable parallelism in mind.
    2. Prior methods of automatically extracting parallelism face diminishing returns:
      • Wider instruction issue only multiplies parallelism by a handful (6-ish). Making even wider issue is expensive and data hazards limit the extracted parallelism.
      • Automatically parallelizing compilers have to prove two potential parallel threads are not accessing the same memory (independent), so they only work on codes with simple access patterns.
  3. So, how do we automatically extract parallelism from single-threaded codes with complex access patterns?

Solution

  • Thread-Level Data Speculation (TLDS): make loads return the data from memory, speculating other threads are independent, but reserve the right to squash-and-retry the thread AND all in-flight subsequent threads if it is not.
    • While compilers rarely prove that two threads are certainly independent, it's far easier to prove that they are likely independent.
    • Note, this is memory dependencies (A doesn't store to any address that B loads), not register dependencies, which they handle separately.
    • E.g. Application has sequential writes to a big hash table. Usually, writes are to different keys, so independent, but not guaranteed. TLDS can parallelize this by speculatively assuming the keys are different while maintaining correctness when they are the same.
  • Speculative region: some target to parallelize (like a loop).
  • Epoch: a sequential iteration within the speculative region, to be turned into a parallel thread.
  • This will decrease performance when:
    • there are read-after-write data dependencies through memory (within the speculative window). Empirically, one can select speculative regions where this is rare.
    • synchronization/communication latency is high
  • Some register dependencies may still remain between epochs (e.g. iteration i+1 might depend on iteration i for some result). Architecture can use data can be forwarded to address known dependencies between epochs at a fine-grain or coarse-grain
    • Coarse-grain is from the end of one epoch to the start of another
    • Fine-grain is from the store of one epoch to the load of another
    • Critical path determines an upper-bound on speedup, less than that of Amdahl's Law.
      • It is computed as sum(compute time until load + communication latency for each epoch in the speculative region)

Architectural Support

  • Communicate through shared L2.
  • Piggy-back cache coherence to detect epoch independence violations.
    • When speculative, invalidation requests carry epoch number. If epoch number for write is newer, violation.
    • If a speculatively loaded line is evicted, we won't know if speculation failed; we must conservatively assume it failed.
    • Since this is tracked at a line-granularity, we have false-sharing.
    • We need word-granularity "dirty" bits, because of write-after-read and write-after-write false-sharing.
  • Piggy-back cache coherence to do flushing.
    • Speculative stores get evicted when speculation fails, flushed to mem if it succeeds.
    • What if the working-set of an epoch doesn't fit in the private cache? Even if it does, what about associativity mapping conflicts?
      • Probably require at least some associativity (2-way) and a victim-cache.
  • Each Thread executes the following pseudo-code:
set epoch number to i
become speculative // can delay until first possibly-non-independent access
do epoch body for i
wait until oldest epoch // acquire-consistency
become non-speculative
if (speculation succeeded) {
    write back
} else {
    squash child threads
    do epoch body for i
}
commit speculatively forwarded values
make child oldest epoch // release-consistency

Compiler support

  • Need to identify candidates for speculative regions.
    • Places that are expensive and can be broken up into epochs that are probably independent.
    • Epochs should be large enough that they amortize the cost of thread management, but small enough that their working set fits in the L1 cache + victim cache.
    • Do by hand. Probably could use profiling.
  • Optimizations
    • Minimize dependences (e.g. calculate loop-inducation variable without referencing the previous N epochs, where N is the number of processors).
    • Do instruction scheduling that puts the loads as far down as possible, and stores as far up.
      • Doesn't this seem contradictory? Data-forwarding is probably load, compute, store, so you probably can't move the load down without moving the store down, vice-versa for moving them up.

Evaluation

  • SPEC92, SPEC95, and NAS Parallel benchmark suites. Details in technical report.
  • Trace-based simulator for MIPS architecture in IRIX OS.
    • Perfectly-pipelined single-issue processor with single-cycle instructions.

Future Work

  • Could a simpler implementation of this use hardware transactional memory?
  • Need an automated way of identifying good candidates for speculative regions and epochs.

Critiques

  • They present TLDS as an alternative to superscalar processors, but never do a head-to-head comparison.
  • Their simulator might be too simplistic.
    • Data forwarding would put pressure on load reservation units.
    • Non TLDS code would benefit from a prevoius epoch bringing data into the cache; their simulation of TLDS would have to loose that benefit, in order to be fair.
    • Why did they not just use a micro-architectural simulation?
  • Can this idea be implemented cheaply on top of existing transactional memory systems?
  • Does it make sense for the compiler to mark those accesses in an epoch which _are_ independent? They don't have to be checked at runtime.
  • Do the compilation stuff on-the-fly with dynamic profiling data.