Citation: James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Katherine Yelick (2008/05/14) Avoiding communication in sparse matrix computations. IEEE International Symposium on Parallel and Distributed Processing (RSS)
DOI (original publisher): 10.1109/IPDPS.2008.4536305
Semantic Scholar (metadata): 10.1109/IPDPS.2008.4536305
Sci-Hub (fulltext): 10.1109/IPDPS.2008.4536305
Internet Archive Scholar (search for fulltext): Avoiding communication in sparse matrix computations
Download: https://ieeexplore.ieee.org/document/4536305
Tagged: Computer Science (RSS) computer architecture (RSS)

Summary

The authors present an algorithm for computing $\{x,Ax,A^{2}x,\dots \}$ which trades off communication for increased computation. This is a good tradeoff for parallel machines with a high latency (such as those connected over the internet).

Problem

Many matrix methods require generating the vectors $\{x,Ax,A^{2}x,\dots \}$ where $A$ is sparse.
The traditional method is serial and has a lot communication.
Improvements in compute have far outpaced improvements in communication.
Why not have new algorithm. It might have redundant compute, but less communication.

Solution

I will reuse notation from the paper.
PA0 (conventional):

Input matrix A and vector x.
Each coordinate, j, is owned by exactly one processor, q. This is denoted  $j\in G_{q}$ .
Processor q does:
  For i from 1 to k:
    For each processor r, skipping q:
      Send every element of  $A^{i-1}x$  that I have that it needs to compute the coordinates it owns.
      (Remember, A is sparse, so most of the coordinates are not needed)
    For each processor r, skipping q:
      Receive every element of  $A^{i-1}x$  I need that it has.
  Compute the owned-by-q coordinates of  $A\cdot A^{i-1}x$ .
  (Note that some computations will be stalled waiting for receives, but others can continue)

PA1: See paper for pseudocode.
PA2: See paper for pseudocode. This is a novel contribution.
SA0 (conventional):

For i from 1 to k:
  For j from 1 to p:
    Compute  $A_{cj:c(j+1)}^{i}x\gets A_{cj:c(j+1)}\cdot A^{i-1}x$ 
  (Note they assume that x (and thus  $A^{i}x$ ) fit in memory but not A. The matrix-multiplication can be done in p chunks of size c rows of A in memory at a time. In this case, A is read k times.)

SA1:

(chunk outside the exponential loop, not inside)
for j from 1 to p
  for i from 1 to k
    Compute  $A_{cj:c(j+1)}^{i}x\gets A_{cj:c(j+1)}\cdot A^{i-1}x$

SA2: This is a novel contribution.
- Proceeds as in SA1, but x is also sparse. SA2 only loads the chunks of coordinates of SA1 actually required.

Evaluation

The authors analytically derive the communication cost and compute-time cost of these algorithms.
According to simulations, PA2 and SA2 can have significant speedups on real-world workloads.
They validate their results with an experimental implementation.

Avoiding communication in sparse matrix computations

Summary

Problem

Solution

Evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

New

Discussion

Help

Tools