A data locality optimizing algorithm
From AcaWiki
Citation: Micheal E. Wolf, Monica S. Lam (1991/06/26) A data locality optimizing algorithm. ACM SIGPLAN Conference on Programming Language Design and Implementation (RSS)
DOI (original publisher): 10.1145/113445.113449
Semantic Scholar (metadata): 10.1145/113445.113449
Sci-Hub (fulltext): 10.1145/113445.113449
Internet Archive Scholar (search for fulltext): A data locality optimizing algorithm
Download: https://dl.acm.org/doi/pdf/10.1145/113445.113449
Tagged: Computer Science
(RSS) compilers (RSS)
Summary
This paper introduces the unimodular framework to maximize data-locality in nested loops.
Elsewhere
Problem
- Many loop-optimizations. Which to apply?
- How to determine legality and advantage of optimizations?
- Greedy solution won't work. One might need to do optimization loop-skewing (no benefit) in order to do loop-tiling (big benefit).
- Exhaustive solution is too slow. Search-space could be infinite.
Background
- Advantage in this paper, comes from exploiting temporal and spatial locality in the cache. If the processor can reuse data from the cache, it doesn't have to go back/forth to main memory.
- Self-temporal reuse: The same element is reused by this reference. E.g. Reference 1 in the example below; iteration (i, j) and iteration (i, j+1) use the same A[i].
- Self-spatial reuse: Nearby elements are used by this reference. Superset of self-temporal reuse. E.g. Reference 2: iteration (i, j) uses A[j], while the A[j+1] iteration uses A[j+1], which is nearby.
- Group-temporal reuse: Same element is reused, by other references. E.g. Reference 2 and 3: iteration (i, j) uses A[j+1] by reference 2, and iteration (i, j+1) uses A[j+1] by reference 3.
- Group-spatial reuse: Nearby elements are reused by other references. Superset of group-spatial reuse. E.g. Reference 3 and 4. Iteration (i, j) uses A[j+2] by reference 4, while iteration (i, j+1) uses A[j+1] by reference 3, which is nearby.
for i from 0 to max_i: for j from 0 to max_j: A[i] /*1*/ += A[j] /*2*/ + A[j+1] /*3*/ + A[j+2] /*4*/
- This paper only considers three kinds of loop transformations
- Tiling: one loop into a pair of loops, where the size of the inner-loop is statically fixed.
- Helps if the inner-loop accesses fit in the cache.
- Tiling: one loop into a pair of loops, where the size of the inner-loop is statically fixed.
for i from 0 to max_i: f(i) for i from 0 to max_i by stride: # stride is fixed at compile-time # so this could be unrolled. for j from i to i + stride: f(j)
- Skewing: Turn a horizontal-then-vertical iteration into a diagonal-then-diagonal iteration.
- Iterating the same nodes in a different order can help remove a loop-dependency. However, reordering is not always valid; iterations must still be evaluated after their dependents.
- Skewing: Turn a horizontal-then-vertical iteration into a diagonal-then-diagonal iteration.
for i from 0 to max_i: for j from 0 to max_j: # go left-to-right, then top-to-bottom f(i, j) for sum_ij from 0 to max_i + max_j: for dif_ij from ... to ...: # go southwest-to-northeast, then northwest-to-southest f(sum_ij - dif_ij, sum_ij + dif_ij)
- Loop interchange: Switch the order of two nested loops.
- This can help exploit locality.
- Loop interchange: Switch the order of two nested loops.
for i from 0 to max_i: for j from 0 to max_j: # if A is row-major, interchanging these two loops leads to greater spatial locality. A[j, i] += 1
for j from 0 to max_j: for i from 0 to max_i: A[j, i] += 1
Solution
- Node: A single iteration of the loop, represented as a vector (specifies values for each loop-counter).
- Dependence vector: Node a + v depends on node a.
- Lexicographic order:
- We assume that nodes are iterated in lexicographic order.
- Dependence vectors must be lexicographically greater than the zero-vector (aka lexicographically positive). Otherwise a + v is evaluated before its dependency a is evaluated. For example, (i,j) can depend on (i-1,j+1) (previous row, next column), which has already been evaluated, but not (i+1,j-1), (next row, previous column).
- Loop transformation: A matrix that maps nodes in the old loop to nodes in the new loop, . It must be where n is the number of loops, have integer entries (because the input and output must be integers), and have a determinant of (so it doesn't stretch the coordinates to skip numbers numbers, like would). This class of matrices is called unimodular.
- A loop transformation is legal if all dependency vectors remain lexicographically positive after transformation.
- For the special case of loop-interchange, two loops i and j are interchangeable (aka permutable) if all dependence vectors are lexicographically positive, before reaching coordinate . E.g. (0, 0, 1, 0, -1, 1, 0). Everything after the third loop is permutable, since the third coordinate makes the whole thing lexicographically positive.
- Uniformly generated references: a reference to array A is generated by matrix H if it can be written as , where i are the coordinates of the node and c is constant.
- The authors If two references are generated by different H-matrices, they might intersect, but it is rare. The authors are primarily concerned with reuse between references with the same H, aka uniformly generated reference sets (UGRS)
- Self-temporal reuse exists in a UGRS if there are two iterations such that . Then is the reuse vector (like the dependence vector). This is exactly .
- Self-spatial reuse is harder. The authors say that temporal reuse has to exist in the inner-most loop, because by the time the outer loops recur, the cache has been blasted by the inner-loop's accesses. So they define as H with the last row zeroed out. Self-spatial reuse is the , asking if there are any accesses to the same row (ignoring the column).
- Group-temporal reuse is like self-temporal, except we are considering two references with different vectors. Reuse exists if there are infinite solutions to (for self-temporal, these are equal). H is invertible, so there is certainly one solution; There are infinite solutions if where L is the vector-space of all nodes.
- Group-spatial reuse is the same as group-temporal but with .
- The authors find a formula to count the amount of reuse they detected with the previous methods.
- A loop transformation is legal if all dependency vectors remain lexicographically positive after transformation.
Algorithm
- Separate the loop-nest into inner-loops which are all mutually interchangeable and outer-loops which are not.
- Some loops are not possible (I'm not sure why).
- Place the inner loops in an order that maximizes reuse, which can be counted according to a formula.
- This is grows with the number of loops, but the number of loops is generally small in practice.
Evaluation
- The authors implement their algorithm in the Stanford SUIF compiler.
- The authors show speedup on LU decomposition and successive over-relaxation.