GPUs and the Future of Parallel Computing
From AcaWiki
Citation: Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, David Glasco (2011/10/17) GPUs and the Future of Parallel Computing. IEEE Micro (RSS)
DOI (original publisher): 10.1109/MM.2011.89
Semantic Scholar (metadata): 10.1109/MM.2011.89
Sci-Hub (fulltext): 10.1109/MM.2011.89
Internet Archive Scholar (search for fulltext): GPUs and the Future of Parallel Computing
Download: https://ieeexplore.ieee.org/document/6045685
Tagged: Computer Science
(RSS) computer architecture (RSS)
Summary
Placeholder
Elsewhere
Problem
- Clock-rate scaling has hit diminishing returns, have to go to multicore.
- Lots of different kinds of multicore. GPUs present {a large number of simple cores, thousands of parallel fine-graine threads, and large memory bandwidth}.
- But they have inherent difficulties: {energy efficiency is hard, memory bandwidth is increasing at a decreasing rate, and parallel programming is hard}.
- Energy-efficiency:
- Modern CPUs have branch prediction, out-of-order execution, large caches because when they were conceived, power was plentiful and performance was the primary goal. Now, those are flipped.
- A lot of energy is spent reading/writing from memory. Try to minimize this with better caches at the algorithm, compiler, and architectural levels.
- Potential solutions: in-order cores, instruction registers, shallower pipelines, register-file caching
- Memory bandwidth:
- Bandwidth impacts performance on memory-bound applications.
- Bandwidth impacts power.
- Potential solutions: through-silicon vias, 3D stacking, NUMA, data compression, scratch pads.
- Programmability:
- Programming languages should make the developer care about locality (basically NUMA aware).
- Need relaxed memory models, possibly dynamic.
- No good way to manage tens of thousands of threads.
- Need to support heterogeneous architectures, dynamically.
- Energy-efficiency:
Solution
- NVIDIA Echelon
- CPU and GPU on the same die.
- Unified virtual memory.
- Have separate latency-optimized cores and throughput-optimized cores.
- Throughput-optimized cores
- 8-lane MIMD or SIMT (decided dynamically).
- long instruction-word
- Register-file caching
- Multi-level scheduling (basically SMT)
- Latency-optimized cores
- Basically traditional CPUs
- Throughput-optimized cores
- Memory system
- Scratch-pads
- Software-directed caching
- Programming models
- Unified memory
- "Selective" coherence
- Thread migration (how?)
- Arbitrary thread synchronization (as opposed to traditional thread-block synchronization)