# Avoiding communication in sparse matrix computations

Citation: James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Katherine Yelick (2008/05/14) Avoiding communication in sparse matrix computations. IEEE International Symposium on Parallel and Distributed Processing (RSS)
DOI (original publisher): 10.1109/IPDPS.2008.4536305
Sci-Hub (fulltext): 10.1109/IPDPS.2008.4536305
Internet Archive Scholar (fulltext): Avoiding communication in sparse matrix computations

# Summary (Abstract)

The authors present an algorithm for computing $\{x,Ax,A^{2}x,\dots \}$ which trades off communication for increased computation. This is a good tradeoff for parallel machines with a high latency (such as those connected over the internet).

## Problem

1. Many matrix methods require generating the vectors $\{x,Ax,A^{2}x,\dots \}$ where $A$ is sparse.
2. The traditional method is serial and has a lot communication.
3. Improvements in compute have far outpaced improvements in communication.
4. Why not have new algorithm. It might have redundant compute, but less communication.

## Solution

• I will reuse notation from the paper.
• PA0 (conventional):
Input matrix A and vector x.
Each coordinate, j, is owned by exactly one processor, q. This is denoted $j\in G_{q}$ .
Processor q does:
For i from 1 to k:
For each processor r, skipping q:
Send every element of $A^{i-1}x$ that I have that it needs to compute the coordinates it owns.
(Remember, A is sparse, so most of the coordinates are not needed)
For each processor r, skipping q:
Receive every element of $A^{i-1}x$ I need that it has.
Compute the owned-by-q coordinates of $A\cdot A^{i-1}x$ .
(Note that some computations will be stalled waiting for receives, but others can continue)

• PA1: See paper for pseudocode.
• PA2: See paper for pseudocode. This is a novel contribution.
• SA0 (conventional):
For i from 1 to k:
For j from 1 to p:
Compute $A_{cj:c(j+1)}^{i}x\gets A_{cj:c(j+1)}\cdot A^{i-1}x$ (Note they assume that x (and thus $A^{i}x$ ) fit in memory but not A. The matrix-multiplication can be done in p chunks of size c rows of A in memory at a time. In this case, A is read k times.)

• SA1:
(chunk outside the exponential loop, not inside)
for j from 1 to p
for i from 1 to k
Compute $A_{cj:c(j+1)}^{i}x\gets A_{cj:c(j+1)}\cdot A^{i-1}x$ • SA2: This is a novel contribution.
• Proceeds as in SA1, but x is also sparse. SA2 only loads the chunks of coordinates of SA1 actually required.

## Evaluation

• The authors analytically derive the communication cost and compute-time cost of these algorithms.
• According to simulations, PA2 and SA2 can have significant speedups on real-world workloads.
• They validate their results with an experimental implementation.