In-Datacenter Performance Analysis of a Tensor Processing Unit

From AcaWiki
Jump to: navigation, search

Citation: Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Thomas Borchers, Rick Boyle, Pierre Luc Cantin, Clifford Chao, Christopher M Clark, Jeremy Coriell, Mike Daley, Matt Dau, J. Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon (2017/06) In-Datacenter Performance Analysis of a Tensor Processing Unit. Annual International Symposium on Computer Architecture (RSS)
DOI (original publisher): 10.1145/3079856.3080246
Semantic Scholar (metadata): 10.1145/3079856.3080246
Sci-Hub (fulltext): 10.1145/3079856.3080246
Internet Archive Scholar (search for fulltext): In-Datacenter Performance Analysis of a Tensor Processing Unit
Download: https://dl.acm.org/doi/10.1145/3079856.3080246
Tagged: Computer Science (RSS) Computer Architecture (RSS)

Summary

Deep neural networks are a crucial workload for Google. GPUs are not sufficient. So much so, that they design a custom hardware, with a 256 x 256 matrix multiply intrinsic. This is far better than a comparable GPU in power-efficiency and response-time.

Theoretical and Practical Relevance

This paper is extremely well-cited. Academia was already interested in making neural network accelerators, but this paper was a practical example. Google actually produced and deployed the TPU, which gives them insights other papers don't have. For example, response-time is more important than throughput, LSTMs and MLPs are more important than CNNs, and memory bandwidth utilization is not as important.


Elsewhere on the Web

Problem

  1. DNN inference is important to Google.
  2. GPUs are slow and power-hungry.
  3. Instead, let's use specialized architecture.

Solution

  • Tensor Processing Unit (TPU)
    • Provides a 256 x 256 8-bit vector multiply-and-accumulate compute element.
      • Weights are preloaded and reused onto each of the 256 x 256 systolic array compute element.
      • Google's animated presentation
      • Each compute element inputs a data item from the east, inputs a partial sum from the north, outputs weight * data + partial sum to the south, and outputs data item to the west. When this is complete, the southern-most compute elements have the dot-product of one row of input against one column of weights.
    • Connects to CPU over PCIe
    • CISC instructions, because instruction fetch over PCIe is slow.

Evaluation

  • Throughput plotted in a roofline model for various NNs:
    • TPU: Everything except the CNNs were memory-bound. The memory bandwidth is so high, that that's ok. Most points near the roofline.
    • GPU and CPU: well below the roofline, many points are compute-bound.
  • Response time for Google workloads:
    • Note that MLP is used more than LSTM is used more than CNN. Prior work focused to much on CNNs.
    • Developers care more about response-time than throughput.
    • Some microarchitectural optimizations such as speculation improve the average case at the expense of the worst case (misprediction). TPU avoids these.
    • TPU > GPU and CPU.
  • Can't report total operating cost for business reasons, but can report ops / W: TPU > GPU > CPU.

Takeaways

  • The discussion and conclusion are worth reading. Here are some salient observations:
    • Latency is more important than throughput for them.
  • Software written in TensorFlow can compile down to TPU. But more work needs to be done on tuning the execution for the TPU (loop tiling, kernel fusion, etc.).
    • This is where compiler works such as TVM and XLA come in.
    • Prior FPGA works have to use verilog.
  • Increasing the matrix multiply unit would actually hurt performance, because it's harder to tile if the tile is bigger. More of it goes unused.
  • GPU not that much better than CPU, surprisingly, because GPUs focus on throughput, and users care about response-time.
  • Some old, almost abandoned ideas were useful: systolic arrays, decoupled-access/execute, and CISC.
  • Authors believe academia focuses too much on CNNs and floating point NNs.