Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

From AcaWiki
Jump to: navigation, search

Citation: Yu-Hsin Chen, Joel S. Emer, Vivienne Sze (2016/06) Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Computer Architecture News (RSS)
DOI (original publisher): 10.1145/3007787.3001177
Semantic Scholar (metadata): 10.1145/3007787.3001177
Sci-Hub (fulltext): 10.1145/3007787.3001177
Internet Archive Scholar (search for fulltext): Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks
Download: https://dl.acm.org/doi/abs/10.1145/3007787.3001177
Tagged: Computer Science (RSS) Computer Architecture (RSS)

Summary

Convolutional neural networks are of increased performance. While SIMT/SIMD architectures (such as in GPUs) exploit parallelism, they still suffer from suboptimal data movement patterns. Eyeriss is a CNN accelerator with a novel data movement pattern which has important power benefits against traditional GPUs and even other CNN accelerators.

Theoretical and Practical Relevance

Eyeriss has the potential to be better than the TPU for CNNs because of its more efficient data-movement, although I haven't seen head-to-head comparisons. Accelerators such as Eyeriss are going to become only more important given the trend of dark silicon and the end of Moore's Law.


Problem

  1. CNNs are important but expensive
  2. SIMT/SIMD address compute, but not data movement
    • Data movement => energy cost and memory bandwidth
  3. There is a lot of reuse, but it is hard to exploit all of it.

Background

  • Multiple channels of weights (aka filters) are moved across the input feature map (aka ifmap), multiplied, and summed with a bias to produce the output feature map (aka ofmap).
  • CNN inference on a batch of inputs has a 4D element-wise multiplication with a 3D sum,
  • Input reuse types:
    • Convolutional reuse: Since the weights are moved across the ifmap, each individual weight is reused length(ifmap)*width(ifmap) times.
    • Filter reuse: Each filter is reused across a batch of ifmaps.
    • Ifmap reuse: Each ifmap pixel is used in each of the filters.
  • Architecture can't exploit input reuse if partial sums are reduced immediately, but not reducing partial sums causes an explosion of intermediate values (thus memory utilization).
  • Techniques from image processing can't be applied to CNNs because:
    • In CNNS, the filters have to be loaded from memory, which is an important cost.
    • In CNNs, 4D convolutions are important, while in image processing 2D convolutions are.

Contribution

  • Taxonomy of CNN dataflows
    • Weight Stationary
    • Output Stationary (multiple ofmap channels vs single ofmap channel) x (multiple ofmap-plane pixesl vs single ofmap-plane pixels)
    • No local Reuse
    • Row Stationary (novel in this paper)
    • This textbook explains the taxonomy in detail.

Solution

  • Spatial architecture
  • Row Stationary dataflow

Evaluation

  • Simulation
    • Energy analysis with 4-level memory hierarchy (DRAM, global buffer, array, register)
  • Fabricated chip confirms results

References