EOLE: Paving the Way for an Effective Implementation of Value Prediction
Citation: Arthur Perais, André Seznec (2014/06/18) EOLE: Paving the Way for an Effective Implementation of Value Prediction. Annual International Symposium on Computer Architecture, ISCA (RSS)
DOI (original publisher): 10.1109/ISCA.2014.6853205
Semantic Scholar (metadata): 10.1109/ISCA.2014.6853205
Sci-Hub (fulltext): 10.1109/ISCA.2014.6853205
Internet Archive Scholar (search for fulltext): EOLE: Paving the Way for an Effective Implementation of Value Prediction
Download: https://ieeexplore.ieee.org/abstract/document/6853205
Tagged: Computer Science
(RSS) Computer Architecture (RSS)
Summary
The authors present value prediction (VP) as a solution to extract more instruction-level parallelism (ILP) and thus single-threaded performance. Conventional VP involves complex and power-hungry circuits, so the authors present a simpler model of VP that piggy-backs on existing out-of-order (OoO) circuits.
Theoretical and Practical Relevance
Value prediction remains a research idea, not yet commercially exploited.
Problem
- Computers need higher single-threaded performance (even in multicore era).
- To do this, architectures need to extract more ILP.
- Traditional approaches (increasing ROB, speculative scheduling, speculative replay (to recover from misspeculation), increasing issue width) have a deleterious impact on power consumption.
- Alternative approach is VP, executing operations before their operands are ready, but previously this has been too complex to be practical.
Solution
The {Early, Out-of-Order, Late} Execution (EOLE) architecture with VP can extract more ILP than conventional architectures (see Figure 1 of the paper (unfortunately, this figure is non-free)).
- To make VP practical, we piggy-back on existing instruction commit/retirement phase in OoO processors.
- This implies no need for selective reply when value prediction is wrong; the architecture can just squash the pipeline.
- But isn't squashing the pipeline more expensive?
- This implies no need for selective reply when value prediction is wrong; the architecture can just squash the pipeline.
- Early Execution (EE): Single-cycle instructions can be executed in the front-end, bypassing the expensive OoO scheduler and renamer, if their operands can be predicted.
- Multi-cycle instructions are too complex, so EE focuses on single-cycle instructions.
- This reduces pressure on the OoO window, register rename file, while requiring only simple hardware.
- Late execution (LE): value predictions can be checked in LE.
- LE focuses on single-cycle instructions and branch-predictions whose confidence is high.
- Value predictor scheme: It seemed to me that they associate each static instruction with a value, which would be its prediction next time. That seems naive (many mispredicts); perhaps I misunderstdood.
Evaluation
- How many instructions can bypass OoO engine through LE and EE? 10 -- 60% on certain SPEC CPU 2000 and 2006 benchmarks.
- Used Gem5 simulation. See paper for parameters.
- Measure IPC and speedup.
- On some benchmarks, the instruction queue size mattered a lot; on others, it barely made a difference.
- Hardware complexity
- With EOLE, one can reduce the reduce issue width from 6 to 4, while maintaining performance.
- EOLE implies adding an LE block and EE block to the pipeline and increase the ports on the physical register file (PRF).
- Could mitigate PRF by adding banks.
- LE and EE could share PRF ports.
- Could mitigate PRF by adding banks.
- They have no way of measuring silicon area or power consumption. How can they be sure they are lower?