# Avoiding communication in sparse matrix computations

**Citation:** *James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Katherine Yelick (2008/05/14) Avoiding communication in sparse matrix computations. IEEE International Symposium on Parallel and Distributed Processing (RSS)*

**DOI (original publisher):** 10.1109/IPDPS.2008.4536305

**Semantic Scholar (metadata)**: 10.1109/IPDPS.2008.4536305

**Sci-Hub (fulltext)**: 10.1109/IPDPS.2008.4536305

**Internet Archive Scholar (fulltext)**: Avoiding communication in sparse matrix computations

**Download:** https://ieeexplore.ieee.org/document/4536305

**Tagged:** Computer Science
(RSS) computer architecture (RSS)

**Summary:**

The authors present an algorithm for computing which trades off communication for increased computation. This is a good tradeoff for parallel machines with a high latency (such as those connected over the internet).

## Problem

- Many matrix methods require generating the vectors where is sparse.
- The traditional method is serial and has a lot communication.
- Improvements in compute have far outpaced improvements in communication.
- Why not have new algorithm. It might have redundant compute, but less communication.

## Solution

- I will reuse notation from the paper.
**PA0**(conventional):

Input matrixAand vectorx. Each coordinate,j, is owned by exactly one processor,q. This is denoted . Processorqdoes: Forifrom 1 tok: For each processorr, skippingq: Send every element of that I have that it needs to compute the coordinates it owns. (Remember,Ais sparse, so most of the coordinates are not needed) For each processorr, skippingq: Receive every element of I need that it has. Compute the owned-by-qcoordinates of . (Note that some computations will be stalled waiting for receives, but others can continue)

**PA1**: See paper for pseudocode.**PA2**: See paper for pseudocode. This is a novel contribution.**SA0**(conventional):

For i from 1 tok: For j from 1 top: Compute (Note they assume thatx(and thus ) fit in memory but notA. The matrix-multiplication can be done inpchunks of sizecrows ofAin memory at a time. In this case,Ais readktimes.)

**SA1**:

(chunkoutsidethe exponential loop, not inside) forjfrom 1 topforifrom 1 tokCompute

**SA2**: This is a novel contribution.- Proceeds as in SA1, but
*x*is also sparse. SA2 only loads the chunks of coordinates of SA1 actually required.

- Proceeds as in SA1, but

## Evaluation

- The authors analytically derive the communication cost and compute-time cost of these algorithms.
- According to simulations, PA2 and SA2 can have significant speedups on real-world workloads.
- They validate their results with an experimental implementation.