# Avoiding communication in sparse matrix computations

Citation: James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Katherine Yelick (2008/05/14) Avoiding communication in sparse matrix computations. IEEE International Symposium on Parallel and Distributed Processing (RSS)
DOI: 10.1109/IPDPS.2008.4536305
Semantic Scholar: 10.1109/IPDPS.2008.4536305

Summary:

The authors present an algorithm for computing ${\displaystyle \{x,Ax,A^{2}x,\dots \}}$ which trades off communication for increased computation. This is a good tradeoff for parallel machines with a high latency (such as those connected over the internet).

## Problem

1. Many matrix methods require generating the vectors ${\displaystyle \{x,Ax,A^{2}x,\dots \}}$ where ${\displaystyle A}$ is sparse.
2. The traditional method is serial and has a lot communication.
3. Improvements in compute have far outpaced improvements in communication.
4. Why not have new algorithm. It might have redundant compute, but less communication.

## Solution

• I will reuse notation from the paper.
• PA0 (conventional):
Input matrix A and vector x.
Each coordinate, j, is owned by exactly one processor, q. This is denoted ${\displaystyle j\in G_{q}}$.
Processor q does:
For i from 1 to k:
For each processor r, skipping q:
Send every element of ${\displaystyle A^{i-1}x}$ that I have that it needs to compute the coordinates it owns.
(Remember, A is sparse, so most of the coordinates are not needed)
For each processor r, skipping q:
Receive every element of ${\displaystyle A^{i-1}x}$ I need that it has.
Compute the owned-by-q coordinates of ${\displaystyle A\cdot A^{i-1}x}$.
(Note that some computations will be stalled waiting for receives, but others can continue)
• PA1: See paper for pseudocode.
• PA2: See paper for pseudocode. This is a novel contribution.
• SA0 (conventional):
For i from 1 to k:
For j from 1 to p:
Compute ${\displaystyle A_{cj:c(j+1)}^{i}x\gets A_{cj:c(j+1)}\cdot A^{i-1}x}$
(Note they assume that x (and thus ${\displaystyle A^{i}x}$) fit in memory but not A. The matrix-multiplication can be done in p chunks of size c rows of A in memory at a time. In this case, A is read k times.)
• SA1:
(chunk outside the exponential loop, not inside)
for j from 1 to p
for i from 1 to k
Compute ${\displaystyle A_{cj:c(j+1)}^{i}x\gets A_{cj:c(j+1)}\cdot A^{i-1}x}$
• SA2: This is a novel contribution.
• Proceeds as in SA1, but x is also sparse. SA2 only loads the chunks of coordinates of SA1 actually required.

## Evaluation

• The authors analytically derive the communication cost and compute-time cost of these algorithms.
• According to simulations, PA2 and SA2 can have significant speedups on real-world workloads.
• They validate their results with an experimental implementation.