GPUs and the Future of Parallel Computing

From AcaWiki
Jump to: navigation, search

Citation: Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, David Glasco (2011/10/17) GPUs and the Future of Parallel Computing. IEEE Micro (RSS)
DOI (original publisher): 10.1109/MM.2011.89
Semantic Scholar (metadata): 10.1109/MM.2011.89
Sci-Hub (fulltext): 10.1109/MM.2011.89
Internet Archive Scholar (search for fulltext): GPUs and the Future of Parallel Computing
Tagged: Computer Science (RSS) computer architecture (RSS)





  • Clock-rate scaling has hit diminishing returns, have to go to multicore.
  • Lots of different kinds of multicore. GPUs present {a large number of simple cores, thousands of parallel fine-graine threads, and large memory bandwidth}.
  • But they have inherent difficulties: {energy efficiency is hard, memory bandwidth is increasing at a decreasing rate, and parallel programming is hard}.
    • Energy-efficiency:
      • Modern CPUs have branch prediction, out-of-order execution, large caches because when they were conceived, power was plentiful and performance was the primary goal. Now, those are flipped.
      • A lot of energy is spent reading/writing from memory. Try to minimize this with better caches at the algorithm, compiler, and architectural levels.
      • Potential solutions: in-order cores, instruction registers, shallower pipelines, register-file caching
    • Memory bandwidth:
      • Bandwidth impacts performance on memory-bound applications.
      • Bandwidth impacts power.
      • Potential solutions: through-silicon vias, 3D stacking, NUMA, data compression, scratch pads.
    • Programmability:
      • Programming languages should make the developer care about locality (basically NUMA aware).
      • Need relaxed memory models, possibly dynamic.
      • No good way to manage tens of thousands of threads.
      • Need to support heterogeneous architectures, dynamically.


  • NVIDIA Echelon
  • CPU and GPU on the same die.
  • Unified virtual memory.
  • Have separate latency-optimized cores and throughput-optimized cores.
    • Throughput-optimized cores
      • 8-lane MIMD or SIMT (decided dynamically).
      • long instruction-word
      • Register-file caching
      • Multi-level scheduling (basically SMT)
    • Latency-optimized cores
      • Basically traditional CPUs
  • Memory system
    • Scratch-pads
    • Software-directed caching
  • Programming models
    • Unified memory
    • "Selective" coherence
    • Thread migration (how?)
    • Arbitrary thread synchronization (as opposed to traditional thread-block synchronization)