EOLE: Paving the Way for an Effective Implementation of Value Prediction

From AcaWiki
Jump to: navigation, search

Citation: Arthur Perais, André Seznec (2014/06/18) EOLE: Paving the Way for an Effective Implementation of Value Prediction. Annual International Symposium on Computer Architecture, ISCA (RSS)
DOI (original publisher): 10.1109/ISCA.2014.6853205
Semantic Scholar (metadata): 10.1109/ISCA.2014.6853205
Sci-Hub (fulltext): 10.1109/ISCA.2014.6853205
Internet Archive Scholar (search for fulltext): EOLE: Paving the Way for an Effective Implementation of Value Prediction
Download: https://ieeexplore.ieee.org/abstract/document/6853205
Tagged: Computer Science (RSS) Computer Architecture (RSS)

Summary

The authors present value prediction (VP) as a solution to extract more instruction-level parallelism (ILP) and thus single-threaded performance. Conventional VP involves complex and power-hungry circuits, so the authors present a simpler model of VP that piggy-backs on existing out-of-order (OoO) circuits.

Theoretical and Practical Relevance

Value prediction remains a research idea, not yet commercially exploited.


Problem

  1. Computers need higher single-threaded performance (even in multicore era).
  2. To do this, architectures need to extract more ILP.
  3. Traditional approaches (increasing ROB, speculative scheduling, speculative replay (to recover from misspeculation), increasing issue width) have a deleterious impact on power consumption.
  4. Alternative approach is VP, executing operations before their operands are ready, but previously this has been too complex to be practical.

Solution

The {Early, Out-of-Order, Late} Execution (EOLE) architecture with VP can extract more ILP than conventional architectures (see Figure 1 of the paper (unfortunately, this figure is non-free)).

  • To make VP practical, we piggy-back on existing instruction commit/retirement phase in OoO processors.
    • This implies no need for selective reply when value prediction is wrong; the architecture can just squash the pipeline.
      • But isn't squashing the pipeline more expensive?
  • Early Execution (EE): Single-cycle instructions can be executed in the front-end, bypassing the expensive OoO scheduler and renamer, if their operands can be predicted.
    • Multi-cycle instructions are too complex, so EE focuses on single-cycle instructions.
    • This reduces pressure on the OoO window, register rename file, while requiring only simple hardware.
  • Late execution (LE): value predictions can be checked in LE.
    • LE focuses on single-cycle instructions and branch-predictions whose confidence is high.
  • Value predictor scheme: It seemed to me that they associate each static instruction with a value, which would be its prediction next time. That seems naive (many mispredicts); perhaps I misunderstdood.

Evaluation

  • How many instructions can bypass OoO engine through LE and EE? 10 -- 60% on certain SPEC CPU 2000 and 2006 benchmarks.
  • Used Gem5 simulation. See paper for parameters.
  • Measure IPC and speedup.
  • On some benchmarks, the instruction queue size mattered a lot; on others, it barely made a difference.
  • Hardware complexity
    • With EOLE, one can reduce the reduce issue width from 6 to 4, while maintaining performance.
    • EOLE implies adding an LE block and EE block to the pipeline and increase the ports on the physical register file (PRF).
      • Could mitigate PRF by adding banks.
        • LE and EE could share PRF ports.
  • They have no way of measuring silicon area or power consumption. How can they be sure they are lower?

Future work

  • The implications of VP in EOLE for memory consistency.
  • Reduce hardware cost of EOLE.
  • Do more kinds of instructions in EE and LE.
  • Combine with other OoO simplificiations, such as multiclusters [8] and PRF optimizations [38].