Avoiding communication in sparse matrix computations

From AcaWiki
Jump to: navigation, search

Citation: James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Katherine Yelick (2008/05/14) Avoiding communication in sparse matrix computations. IEEE International Symposium on Parallel and Distributed Processing (RSS)
DOI (original publisher): 10.1109/IPDPS.2008.4536305
Semantic Scholar (metadata): 10.1109/IPDPS.2008.4536305
Sci-Hub (fulltext): 10.1109/IPDPS.2008.4536305
Internet Archive Scholar (fulltext): Avoiding communication in sparse matrix computations
Download: https://ieeexplore.ieee.org/document/4536305
Tagged: Computer Science (RSS) computer architecture (RSS)

Summary (Abstract)

The authors present an algorithm for computing which trades off communication for increased computation. This is a good tradeoff for parallel machines with a high latency (such as those connected over the internet).


Problem

  1. Many matrix methods require generating the vectors where is sparse.
  2. The traditional method is serial and has a lot communication.
  3. Improvements in compute have far outpaced improvements in communication.
  4. Why not have new algorithm. It might have redundant compute, but less communication.

Solution

  • I will reuse notation from the paper.
  • PA0 (conventional):
Input matrix A and vector x.
Each coordinate, j, is owned by exactly one processor, q. This is denoted .
Processor q does:
  For i from 1 to k:
    For each processor r, skipping q:
      Send every element of  that I have that it needs to compute the coordinates it owns.
      (Remember, A is sparse, so most of the coordinates are not needed)
    For each processor r, skipping q:
      Receive every element of  I need that it has.
  Compute the owned-by-q coordinates of .
  (Note that some computations will be stalled waiting for receives, but others can continue)
  • PA1: See paper for pseudocode.
  • PA2: See paper for pseudocode. This is a novel contribution.
  • SA0 (conventional):
For i from 1 to k:
  For j from 1 to p:
    Compute 
  (Note they assume that x (and thus ) fit in memory but not A. The matrix-multiplication can be done in p chunks of size c rows of A in memory at a time. In this case, A is read k times.)
  • SA1:
(chunk outside the exponential loop, not inside)
for j from 1 to p
  for i from 1 to k
    Compute 
  • SA2: This is a novel contribution.
    • Proceeds as in SA1, but x is also sparse. SA2 only loads the chunks of coordinates of SA1 actually required.

Evaluation

  • The authors analytically derive the communication cost and compute-time cost of these algorithms.
  • According to simulations, PA2 and SA2 can have significant speedups on real-world workloads.
  • They validate their results with an experimental implementation.