Citation: Micheal E. Wolf, Monica S. Lam (1991/06/26) A data locality optimizing algorithm. ACM SIGPLAN Conference on Programming Language Design and Implementation (RSS)
DOI (original publisher): 10.1145/113445.113449
Semantic Scholar (metadata): 10.1145/113445.113449
Sci-Hub (fulltext): 10.1145/113445.113449
Internet Archive Scholar (search for fulltext): A data locality optimizing algorithm
Download: https://dl.acm.org/doi/pdf/10.1145/113445.113449
Tagged: Computer Science (RSS) compilers (RSS)

Summary

This paper introduces the unimodular framework to maximize data-locality in nested loops.

Elsewhere

Problem

Many loop-optimizations. Which to apply?
- How to determine legality and advantage of optimizations?
- Greedy solution won't work. One might need to do optimization loop-skewing (no benefit) in order to do loop-tiling (big benefit).
- Exhaustive solution is too slow. Search-space could be infinite.

Background

Advantage in this paper, comes from exploiting temporal and spatial locality in the cache. If the processor can reuse data from the cache, it doesn't have to go back/forth to main memory.
- Self-temporal reuse: The same element is reused by this reference. E.g. Reference 1 in the example below; iteration (i, j) and iteration (i, j+1) use the same A[i].
- Self-spatial reuse: Nearby elements are used by this reference. Superset of self-temporal reuse. E.g. Reference 2: iteration (i, j) uses A[j], while the A[j+1] iteration uses A[j+1], which is nearby.
- Group-temporal reuse: Same element is reused, by other references. E.g. Reference 2 and 3: iteration (i, j) uses A[j+1] by reference 2, and iteration (i, j+1) uses A[j+1] by reference 3.
- Group-spatial reuse: Nearby elements are reused by other references. Superset of group-spatial reuse. E.g. Reference 3 and 4. Iteration (i, j) uses A[j+2] by reference 4, while iteration (i, j+1) uses A[j+1] by reference 3, which is nearby.

for i from 0 to max_i:
  for j from 0 to max_j:
    A[i] /*1*/   +=    A[j] /*2*/   +   A[j+1] /*3*/   +   A[j+2] /*4*/

This paper only considers three kinds of loop transformations
- Tiling: one loop into a pair of loops, where the size of the inner-loop is statically fixed.
  - Helps if the inner-loop accesses fit in the cache.

for i from 0 to max_i:
  f(i)

for i from 0 to max_i by stride:
  # stride is fixed at compile-time
  # so this could be unrolled.
  for j from i to i + stride:
    f(j)

- Skewing: Turn a horizontal-then-vertical iteration into a diagonal-then-diagonal iteration.
  - Iterating the same nodes in a different order can help remove a loop-dependency. However, reordering is not always valid; iterations must still be evaluated after their dependents.

for i from 0 to max_i:
  for j from 0 to max_j:
    # go left-to-right, then top-to-bottom
    f(i, j)

for sum_ij from 0 to max_i + max_j:
  for dif_ij from ... to ...:
    # go southwest-to-northeast, then northwest-to-southest
    f(sum_ij - dif_ij, sum_ij + dif_ij)

- Loop interchange: Switch the order of two nested loops.
  - This can help exploit locality.

for i from 0 to max_i:
  for j from 0 to max_j:
    # if A is row-major, interchanging these two loops leads to greater spatial locality.
    A[j, i] += 1

for j from 0 to max_j:
  for i from 0 to max_i:
    A[j, i] += 1

Solution

Node: A single iteration of the loop, represented as a vector (specifies values for each loop-counter).
Dependence vector: Node a + v depends on node a.
Lexicographic order: $(0,0,0)<(0,0,1)<(0,0,2)<\dots (0,0,n-1)<(0,1,0)<(0,1,1)<(0,1,2)<\dots (0,m-1,n-1)<(1,0,0)<\dots <(k-1,m-1,n-1)$ $(0,0,0)<(0,0,1)<(0,0,2)<\dots (0,0,n-1)<(0,1,0)<(0,1,1)<(0,1,2)<\dots (0,m-1,n-1)<(1,0,0)<\dots <(k-1,m-1,n-1)$
- We assume that nodes are iterated in lexicographic order.
- Dependence vectors must be lexicographically greater than the zero-vector (aka lexicographically positive). Otherwise a + v is evaluated before its dependency a is evaluated. For example, (i,j) can depend on (i-1,j+1) (previous row, next column), which has already been evaluated, but not (i+1,j-1), (next row, previous column).
Loop transformation: A matrix that maps nodes in the old loop to nodes in the new loop, ${\vec {v}}\mapsto A{\vec {v}}$ ${\vec {v}}\mapsto A{\vec {v}}$ . It must be $n\times n$ $n\times n$ where n is the number of loops, have integer entries (because the input and output must be integers), and have a determinant of $\pm 1$ $\pm 1$ (so it doesn't stretch the coordinates to skip numbers numbers, like ${\begin{pmatrix}2&0\\0&2\\\end{pmatrix}}$ ${\begin{pmatrix}2&0\\0&2\\\end{pmatrix}}$ would). This class of matrices is called unimodular.
- A loop transformation is legal if all dependency vectors remain lexicographically positive after transformation.
  - For the special case of loop-interchange, two loops i and j are interchangeable (aka permutable) if all dependence vectors are lexicographically positive, before reaching coordinate $\min(i,j)$ . E.g. (0, 0, 1, 0, -1, 1, 0). Everything after the third loop is permutable, since the third coordinate makes the whole thing lexicographically positive.
- Uniformly generated references: a reference to array A is generated by matrix H if it can be written as $\mathrm {A} [H{\vec {i}}+{\vec {c}}]$ $\mathrm {A} [H{\vec {i}}+{\vec {c}}]$ , where i are the coordinates of the node and c is constant.
  - The authors If two references are generated by different H-matrices, they might intersect, but it is rare. The authors are primarily concerned with reuse between references with the same H, aka uniformly generated reference sets (UGRS)
- Self-temporal reuse exists in a UGRS if there are two iterations ${\vec {i}},{\vec {i_{2}}}$ such that $H({\vec {i}}-{\vec {i_{2}}})=0$ . Then ${\vec {r}}={\vec {i}}-{\vec {i_{2}}}$ is the reuse vector (like the dependence vector). This is exactly $\ker H$ .
- Self-spatial reuse is harder. The authors say that temporal reuse has to exist in the inner-most loop, because by the time the outer loops recur, the cache has been blasted by the inner-loop's accesses. So they define $H_{S}$ as H with the last row zeroed out. Self-spatial reuse is the $\ker H_{s}$ , asking if there are any accesses to the same row (ignoring the column).
- Group-temporal reuse is like self-temporal, except we are considering two references with different ${\vec {c}},{\vec {c_{2}}}$ vectors. Reuse exists if there are infinite solutions to $H{\vec {r}}={\vec {c}}-{\vec {c_{2}}}$ (for self-temporal, these are equal). H is invertible, so there is certainly one solution; There are infinite solutions if $(\mathrm {span} {\vec {r}}+\ker H)\cap L\neq \ker H\cap L$ where L is the vector-space of all nodes.
- Group-spatial reuse is the same as group-temporal but with $H_{s}$ .
- The authors find a formula to count the amount of reuse they detected with the previous methods.

Algorithm

Separate the loop-nest into inner-loops which are all mutually interchangeable and outer-loops which are not.
- Some loops are not possible (I'm not sure why).
Place the inner loops in an order that maximizes reuse, which can be counted according to a formula.
- This is grows with the number of loops, but the number of loops is generally small in practice.

Evaluation

The authors implement their algorithm in the Stanford SUIF compiler.
The authors show speedup on LU decomposition and successive over-relaxation.

A data locality optimizing algorithm

Summary

Elsewhere

Problem

Background

Solution

Algorithm

Evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

New

Discussion

Help

Tools