Citation: Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, David Glasco (2011/10/17) GPUs and the Future of Parallel Computing. IEEE Micro (RSS)
DOI (original publisher): 10.1109/MM.2011.89
Semantic Scholar (metadata): 10.1109/MM.2011.89
Sci-Hub (fulltext): 10.1109/MM.2011.89
Internet Archive Scholar (search for fulltext): GPUs and the Future of Parallel Computing
Download: https://ieeexplore.ieee.org/document/6045685
Tagged: Computer Science (RSS) computer architecture (RSS)

Summary

Placeholder

Elsewhere

Clock-rate scaling has hit diminishing returns, have to go to multicore.
Lots of different kinds of multicore. GPUs present {a large number of simple cores, thousands of parallel fine-graine threads, and large memory bandwidth}.
But they have inherent difficulties: {energy efficiency is hard, memory bandwidth is increasing at a decreasing rate, and parallel programming is hard}.
- Energy-efficiency:
  - Modern CPUs have branch prediction, out-of-order execution, large caches because when they were conceived, power was plentiful and performance was the primary goal. Now, those are flipped.
  - A lot of energy is spent reading/writing from memory. Try to minimize this with better caches at the algorithm, compiler, and architectural levels.
  - Potential solutions: in-order cores, instruction registers, shallower pipelines, register-file caching
- Memory bandwidth:
  - Bandwidth impacts performance on memory-bound applications.
  - Bandwidth impacts power.
  - Potential solutions: through-silicon vias, 3D stacking, NUMA, data compression, scratch pads.
- Programmability:
  - Programming languages should make the developer care about locality (basically NUMA aware).
  - Need relaxed memory models, possibly dynamic.
  - No good way to manage tens of thousands of threads.
  - Need to support heterogeneous architectures, dynamically.

NVIDIA Echelon
CPU and GPU on the same die.
Unified virtual memory.
Have separate latency-optimized cores and throughput-optimized cores.
- Throughput-optimized cores
  - 8-lane MIMD or SIMT (decided dynamically).
  - long instruction-word
  - Register-file caching
  - Multi-level scheduling (basically SMT)
- Latency-optimized cores
  - Basically traditional CPUs
Memory system
- Scratch-pads
- Software-directed caching
Programming models
- Unified memory
- "Selective" coherence
- Thread migration (how?)
- Arbitrary thread synchronization (as opposed to traditional thread-block synchronization)