# A data locality optimizing algorithm

Citation: Micheal E. Wolf, Monica S. Lam (1991/06/26) A data locality optimizing algorithm. ACM SIGPLAN Conference on Programming Language Design and Implementation (RSS)
DOI: 10.1145/113445.113449
Semantic Scholar: 10.1145/113445.113449

Summary:

This paper introduces the unimodular framework to maximize data-locality in nested loops.

## Problem

• Many loop-optimizations. Which to apply?
• How to determine legality and advantage of optimizations?
• Greedy solution won't work. One might need to do optimization loop-skewing (no benefit) in order to do loop-tiling (big benefit).
• Exhaustive solution is too slow. Search-space could be infinite.

## Background

• Advantage in this paper, comes from exploiting temporal and spatial locality in the cache. If the processor can reuse data from the cache, it doesn't have to go back/forth to main memory.
• Self-temporal reuse: The same element is reused by this reference. E.g. Reference 1 in the example below; iteration (i, j) and iteration (i, j+1) use the same A[i].
• Self-spatial reuse: Nearby elements are used by this reference. Superset of self-temporal reuse. E.g. Reference 2: iteration (i, j) uses A[j], while the A[j+1] iteration uses A[j+1], which is nearby.
• Group-temporal reuse: Same element is reused, by other references. E.g. Reference 2 and 3: iteration (i, j) uses A[j+1] by reference 2, and iteration (i, j+1) uses A[j+1] by reference 3.
• Group-spatial reuse: Nearby elements are reused by other references. Superset of group-spatial reuse. E.g. Reference 3 and 4. Iteration (i, j) uses A[j+2] by reference 4, while iteration (i, j+1) uses A[j+1] by reference 3, which is nearby.
for i from 0 to max_i:
for j from 0 to max_j:
A[i] /*1*/   +=    A[j] /*2*/   +   A[j+1] /*3*/   +   A[j+2] /*4*/

• This paper only considers three kinds of loop transformations
• Tiling: one loop into a pair of loops, where the size of the inner-loop is statically fixed.
• Helps if the inner-loop accesses fit in the cache.
for i from 0 to max_i:
f(i)

for i from 0 to max_i by stride:
# stride is fixed at compile-time
# so this could be unrolled.
for j from i to i + stride:
f(j)

• Skewing: Turn a horizontal-then-vertical iteration into a diagonal-then-diagonal iteration.
• Iterating the same nodes in a different order can help remove a loop-dependency. However, reordering is not always valid; iterations must still be evaluated after their dependents.
for i from 0 to max_i:
for j from 0 to max_j:
# go left-to-right, then top-to-bottom
f(i, j)

for sum_ij from 0 to max_i + max_j:
for dif_ij from ... to ...:
# go southwest-to-northeast, then northwest-to-southest
f(sum_ij - dif_ij, sum_ij + dif_ij)

• Loop interchange: Switch the order of two nested loops.
• This can help exploit locality.
for i from 0 to max_i:
for j from 0 to max_j:
# if A is row-major, interchanging these two loops leads to greater spatial locality.
A[j, i] += 1

for j from 0 to max_j:
for i from 0 to max_i:
A[j, i] += 1


## Solution

• Node: A single iteration of the loop, represented as a vector (specifies values for each loop-counter).
• Dependence vector: Node a + v depends on node a.
• Lexicographic order: ${\displaystyle (0,0,0)<(0,0,1)<(0,0,2)<\dots (0,0,n-1)<(0,1,0)<(0,1,1)<(0,1,2)<\dots (0,m-1,n-1)<(1,0,0)<\dots <(k-1,m-1,n-1)}$
• We assume that nodes are iterated in lexicographic order.
• Dependence vectors must be lexicographically greater than the zero-vector (aka lexicographically positive). Otherwise a + v is evaluated before its dependency a is evaluated. For example, (i,j) can depend on (i-1,j+1) (previous row, next column), which has already been evaluated, but not (i+1,j-1), (next row, previous column).
• Loop transformation: A matrix that maps nodes in the old loop to nodes in the new loop, ${\displaystyle {\vec {v}}\mapsto A{\vec {v}}}$. It must be ${\displaystyle n\times n}$ where n is the number of loops, have integer entries (because the input and output must be integers), and have a determinant of ${\displaystyle \pm 1}$ (so it doesn't stretch the coordinates to skip numbers numbers, like ${\displaystyle {\begin{pmatrix}2&0\\0&2\\\end{pmatrix}}}$ would). This class of matrices is called unimodular.
• A loop transformation is legal if all dependency vectors remain lexicographically positive after transformation.
• For the special case of loop-interchange, two loops i and j are interchangeable (aka permutable) if all dependence vectors are lexicographically positive, before reaching coordinate ${\displaystyle \min(i,j)}$. E.g. (0, 0, 1, 0, -1, 1, 0). Everything after the third loop is permutable, since the third coordinate makes the whole thing lexicographically positive.
• Uniformly generated references: a reference to array A is generated by matrix H if it can be written as ${\displaystyle \mathrm {A} [H{\vec {i}}+{\vec {c}}]}$, where i are the coordinates of the node and c is constant.
• The authors If two references are generated by different H-matrices, they might intersect, but it is rare. The authors are primarily concerned with reuse between references with the same H, aka uniformly generated reference sets (UGRS)
• Self-temporal reuse exists in a UGRS if there are two iterations ${\displaystyle {\vec {i}},{\vec {i_{2}}}}$ such that ${\displaystyle H({\vec {i}}-{\vec {i_{2}}})=0}$. Then ${\displaystyle {\vec {r}}={\vec {i}}-{\vec {i_{2}}}}$ is the reuse vector (like the dependence vector). This is exactly ${\displaystyle \ker H}$.
• Self-spatial reuse is harder. The authors say that temporal reuse has to exist in the inner-most loop, because by the time the outer loops recur, the cache has been blasted by the inner-loop's accesses. So they define ${\displaystyle H_{S}}$ as H with the last row zeroed out. Self-spatial reuse is the ${\displaystyle \ker H_{s}}$, asking if there are any accesses to the same row (ignoring the column).
• Group-temporal reuse is like self-temporal, except we are considering two references with different ${\displaystyle {\vec {c}},{\vec {c_{2}}}}$ vectors. Reuse exists if there are infinite solutions to ${\displaystyle H{\vec {r}}={\vec {c}}-{\vec {c_{2}}}}$ (for self-temporal, these are equal). H is invertible, so there is certainly one solution; There are infinite solutions if ${\displaystyle (\mathrm {span} {\vec {r}}+\ker H)\cap L\neq \ker H\cap L}$ where L is the vector-space of all nodes.
• Group-spatial reuse is the same as group-temporal but with ${\displaystyle H_{s}}$.
• The authors find a formula to count the amount of reuse they detected with the previous methods.

### Algorithm

1. Separate the loop-nest into inner-loops which are all mutually interchangeable and outer-loops which are not.
• Some loops are not possible (I'm not sure why).
2. Place the inner loops in an order that maximizes reuse, which can be counted according to a formula.
• This is grows with the number of loops, but the number of loops is generally small in practice.