# A data locality optimizing algorithm

From AcaWiki

**Citation:** *Micheal E. Wolf, Monica S. Lam (1991/06/26) A data locality optimizing algorithm. ACM SIGPLAN Conference on Programming Language Design and Implementation (RSS)*

**DOI:** 10.1145/113445.113449

**Semantic Scholar**: 10.1145/113445.113449

**Download:** https://dl.acm.org/doi/pdf/10.1145/113445.113449

**Tagged:** Computer Science
(RSS) compilers (RSS)

**Summary:**

This paper introduces the unimodular framework to maximize data-locality in nested loops.

## Elsewhere

## Problem

- Many loop-optimizations. Which to apply?
- How to determine
*legality*and*advantage*of optimizations? - Greedy solution won't work. One might need to do optimization loop-skewing (no benefit) in order to do loop-tiling (big benefit).
- Exhaustive solution is too slow. Search-space could be infinite.

- How to determine

## Background

*Advantage*in this paper, comes from exploiting temporal and spatial locality in the cache. If the processor can reuse data from the cache, it doesn't have to go back/forth to main memory.**Self-temporal reuse**: The same element is reused by this reference. E.g. Reference 1 in the example below; iteration (i, j) and iteration (i, j+1) use the same A[i].**Self-spatial reuse**: Nearby elements are used by this reference. Superset of self-temporal reuse. E.g. Reference 2: iteration (i, j) uses A[j], while the A[j+1] iteration uses A[j+1], which is nearby.**Group-temporal reuse**: Same element is reused, by other references. E.g. Reference 2 and 3: iteration (i, j) uses A[j+1] by reference 2, and iteration (i, j+1) uses A[j+1] by reference 3.**Group-spatial reuse**: Nearby elements are reused by other references. Superset of group-spatial reuse. E.g. Reference 3 and 4. Iteration (i, j) uses A[j+2] by reference 4, while iteration (i, j+1) uses A[j+1] by reference 3, which is nearby.

for i from 0 to max_i: for j from 0 to max_j: A[i] /*1*/ += A[j] /*2*/ + A[j+1] /*3*/ + A[j+2] /*4*/

- This paper only considers three kinds of loop transformations
**Tiling**: one loop into a pair of loops, where the size of the inner-loop is statically fixed.- Helps if the inner-loop accesses fit in the cache.

for i from 0 to max_i: f(i) for i from 0 to max_i by stride: # stride is fixed at compile-time # so this could be unrolled. for j from i to i + stride: f(j)

**Skewing**: Turn a horizontal-then-vertical iteration into a diagonal-then-diagonal iteration.- Iterating the same nodes in a different order can help remove a loop-dependency. However, reordering is not always valid; iterations must still be evaluated after their dependents.

for i from 0 to max_i: for j from 0 to max_j: # go left-to-right, then top-to-bottom f(i, j) for sum_ij from 0 to max_i + max_j: for dif_ij from ... to ...: # go southwest-to-northeast, then northwest-to-southest f(sum_ij - dif_ij, sum_ij + dif_ij)

**Loop interchange**: Switch the order of two nested loops.- This can help exploit locality.

for i from 0 to max_i: for j from 0 to max_j: # if A is row-major, interchanging these two loops leads to greater spatial locality. A[j, i] += 1

for j from 0 to max_j: for i from 0 to max_i: A[j, i] += 1

## Solution

**Node**: A single iteration of the loop, represented as a vector (specifies values for each loop-counter).**Dependence vector**: Node*a + v*depends on node*a*.**Lexicographic order**:- We assume that nodes are iterated in lexicographic order.
- Dependence vectors must be
**lexicographically greater**than the zero-vector (aka**lexicographically positive**). Otherwise*a + v*is evaluated before its dependency*a*is evaluated. For example, (i,j) can depend on (i-1,j+1) (previous row, next column), which has already been evaluated, but not (i+1,j-1), (next row, previous column).

**Loop transformation**: A matrix that maps nodes in the old loop to nodes in the new loop, . It must be where*n*is the number of loops, have integer entries (because the input and output must be integers), and have a determinant of (so it doesn't*stretch*the coordinates to skip numbers numbers, like would). This class of matrices is called*unimodular*.- A loop transformation is
**legal**if all dependency vectors remain*lexicographically positive*after transformation.- For the special case of loop-interchange, two loops
*i*and*j*are**interchangeable**(aka*permutable*) if all dependence vectors are lexicographically positive, before reaching coordinate . E.g. (0, 0, 1, 0, -1, 1, 0). Everything after the third loop is permutable, since the third coordinate makes the whole thing lexicographically positive.

- For the special case of loop-interchange, two loops
**Uniformly generated references**: a reference to array*A*is generated by matrix*H*if it can be written as , where*i*are the coordinates of the node and*c*is constant.- The authors If two references are generated by
*different*H-matrices, they might intersect, but it is rare. The authors are primarily concerned with reuse between references with the same H, aka**uniformly generated reference sets (UGRS)**

- The authors If two references are generated by
- Self-temporal reuse exists in a UGRS if there are two iterations such that . Then is the
**reuse vector**(like the dependence vector). This is exactly . - Self-spatial reuse is harder. The authors say that temporal reuse has to exist in the
*inner-most loop*, because by the time the outer loops recur, the cache has been blasted by the inner-loop's accesses. So they define as*H*with the last row zeroed out. Self-spatial reuse is the , asking if there are any accesses to the same row (ignoring the column). - Group-temporal reuse is like self-temporal, except we are considering two references with different vectors. Reuse exists if there are infinite solutions to (for self-temporal, these are equal).
*H*is invertible, so there is certainly one solution; There are infinite solutions if where*L*is the vector-space of all nodes. - Group-spatial reuse is the same as group-temporal but with .
- The authors find a formula to count the amount of reuse they detected with the previous methods.

- A loop transformation is

### Algorithm

- Separate the loop-nest into inner-loops which are all mutually interchangeable and outer-loops which are not.
- Some loops are not possible (I'm not sure why).

- Place the inner loops in an order that maximizes reuse, which can be counted according to a formula.
- This is grows with the number of loops, but the number of loops is generally small in practice.

## Evaluation

- The authors implement their algorithm in the Stanford SUIF compiler.
- The authors show speedup on LU decomposition and successive over-relaxation.