Avoiding communication in sparse matrix computations
Citation: James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Katherine Yelick (2008/05/14) Avoiding communication in sparse matrix computations. IEEE International Symposium on Parallel and Distributed Processing (RSS)
DOI (original publisher): 10.1109/IPDPS.2008.4536305
Semantic Scholar (metadata): 10.1109/IPDPS.2008.4536305
Sci-Hub (fulltext): 10.1109/IPDPS.2008.4536305
Internet Archive Scholar (search for fulltext): Avoiding communication in sparse matrix computations
Download: https://ieeexplore.ieee.org/document/4536305
Tagged: Computer Science
(RSS) computer architecture (RSS)
Summary
The authors present an algorithm for computing which trades off communication for increased computation. This is a good tradeoff for parallel machines with a high latency (such as those connected over the internet).
Problem
- Many matrix methods require generating the vectors where is sparse.
- The traditional method is serial and has a lot communication.
- Improvements in compute have far outpaced improvements in communication.
- Why not have new algorithm. It might have redundant compute, but less communication.
Solution
- I will reuse notation from the paper.
- PA0 (conventional):
Input matrix A and vector x. Each coordinate, j, is owned by exactly one processor, q. This is denoted . Processor q does: For i from 1 to k: For each processor r, skipping q: Send every element of that I have that it needs to compute the coordinates it owns. (Remember, A is sparse, so most of the coordinates are not needed) For each processor r, skipping q: Receive every element of I need that it has. Compute the owned-by-q coordinates of . (Note that some computations will be stalled waiting for receives, but others can continue)
- PA1: See paper for pseudocode.
- PA2: See paper for pseudocode. This is a novel contribution.
- SA0 (conventional):
For i from 1 to k: For j from 1 to p: Compute (Note they assume that x (and thus ) fit in memory but not A. The matrix-multiplication can be done in p chunks of size c rows of A in memory at a time. In this case, A is read k times.)
- SA1:
(chunk outside the exponential loop, not inside)
for j from 1 to p
for i from 1 to k
Compute
- SA2: This is a novel contribution.
- Proceeds as in SA1, but x is also sparse. SA2 only loads the chunks of coordinates of SA1 actually required.
Evaluation
- The authors analytically derive the communication cost and compute-time cost of these algorithms.
- According to simulations, PA2 and SA2 can have significant speedups on real-world workloads.
- They validate their results with an experimental implementation.