A data locality optimizing algorithm

From AcaWiki
Jump to: navigation, search


Citation: Micheal E. Wolf, Monica S. Lam (1991/06/26) A data locality optimizing algorithm. ACM SIGPLAN Conference on Programming Language Design and Implementation (RSS)
DOI (original publisher): 10.1145/113445.113449
Semantic Scholar (metadata): 10.1145/113445.113449
Sci-Hub (fulltext): 10.1145/113445.113449
Internet Archive Scholar (fulltext): A data locality optimizing algorithm
Download: https://dl.acm.org/doi/pdf/10.1145/113445.113449
Tagged: Computer Science (RSS) compilers (RSS)


Summary:

This paper introduces the unimodular framework to maximize data-locality in nested loops.




Elsewhere

Problem

  • Many loop-optimizations. Which to apply?
    • How to determine legality and advantage of optimizations?
    • Greedy solution won't work. One might need to do optimization loop-skewing (no benefit) in order to do loop-tiling (big benefit).
    • Exhaustive solution is too slow. Search-space could be infinite.

Background

  • Advantage in this paper, comes from exploiting temporal and spatial locality in the cache. If the processor can reuse data from the cache, it doesn't have to go back/forth to main memory.
    • Self-temporal reuse: The same element is reused by this reference. E.g. Reference 1 in the example below; iteration (i, j) and iteration (i, j+1) use the same A[i].
    • Self-spatial reuse: Nearby elements are used by this reference. Superset of self-temporal reuse. E.g. Reference 2: iteration (i, j) uses A[j], while the A[j+1] iteration uses A[j+1], which is nearby.
    • Group-temporal reuse: Same element is reused, by other references. E.g. Reference 2 and 3: iteration (i, j) uses A[j+1] by reference 2, and iteration (i, j+1) uses A[j+1] by reference 3.
    • Group-spatial reuse: Nearby elements are reused by other references. Superset of group-spatial reuse. E.g. Reference 3 and 4. Iteration (i, j) uses A[j+2] by reference 4, while iteration (i, j+1) uses A[j+1] by reference 3, which is nearby.
for i from 0 to max_i:
  for j from 0 to max_j:
    A[i] /*1*/   +=    A[j] /*2*/   +   A[j+1] /*3*/   +   A[j+2] /*4*/
  • This paper only considers three kinds of loop transformations
    • Tiling: one loop into a pair of loops, where the size of the inner-loop is statically fixed.
      • Helps if the inner-loop accesses fit in the cache.
for i from 0 to max_i:
  f(i)

for i from 0 to max_i by stride:
  # stride is fixed at compile-time
  # so this could be unrolled.
  for j from i to i + stride:
    f(j)
    • Skewing: Turn a horizontal-then-vertical iteration into a diagonal-then-diagonal iteration.
      • Iterating the same nodes in a different order can help remove a loop-dependency. However, reordering is not always valid; iterations must still be evaluated after their dependents.
for i from 0 to max_i:
  for j from 0 to max_j:
    # go left-to-right, then top-to-bottom
    f(i, j)

for sum_ij from 0 to max_i + max_j:
  for dif_ij from ... to ...:
    # go southwest-to-northeast, then northwest-to-southest
    f(sum_ij - dif_ij, sum_ij + dif_ij)
    • Loop interchange: Switch the order of two nested loops.
      • This can help exploit locality.
for i from 0 to max_i:
  for j from 0 to max_j:
    # if A is row-major, interchanging these two loops leads to greater spatial locality.
    A[j, i] += 1
for j from 0 to max_j:
  for i from 0 to max_i:
    A[j, i] += 1

Solution

  • Node: A single iteration of the loop, represented as a vector (specifies values for each loop-counter).
  • Dependence vector: Node a + v depends on node a.
  • Lexicographic order:
    • We assume that nodes are iterated in lexicographic order.
    • Dependence vectors must be lexicographically greater than the zero-vector (aka lexicographically positive). Otherwise a + v is evaluated before its dependency a is evaluated. For example, (i,j) can depend on (i-1,j+1) (previous row, next column), which has already been evaluated, but not (i+1,j-1), (next row, previous column).
  • Loop transformation: A matrix that maps nodes in the old loop to nodes in the new loop, . It must be where n is the number of loops, have integer entries (because the input and output must be integers), and have a determinant of (so it doesn't stretch the coordinates to skip numbers numbers, like would). This class of matrices is called unimodular.
    • A loop transformation is legal if all dependency vectors remain lexicographically positive after transformation.
      • For the special case of loop-interchange, two loops i and j are interchangeable (aka permutable) if all dependence vectors are lexicographically positive, before reaching coordinate . E.g. (0, 0, 1, 0, -1, 1, 0). Everything after the third loop is permutable, since the third coordinate makes the whole thing lexicographically positive.
    • Uniformly generated references: a reference to array A is generated by matrix H if it can be written as , where i are the coordinates of the node and c is constant.
      • The authors If two references are generated by different H-matrices, they might intersect, but it is rare. The authors are primarily concerned with reuse between references with the same H, aka uniformly generated reference sets (UGRS)
    • Self-temporal reuse exists in a UGRS if there are two iterations such that . Then is the reuse vector (like the dependence vector). This is exactly .
    • Self-spatial reuse is harder. The authors say that temporal reuse has to exist in the inner-most loop, because by the time the outer loops recur, the cache has been blasted by the inner-loop's accesses. So they define as H with the last row zeroed out. Self-spatial reuse is the , asking if there are any accesses to the same row (ignoring the column).
    • Group-temporal reuse is like self-temporal, except we are considering two references with different vectors. Reuse exists if there are infinite solutions to (for self-temporal, these are equal). H is invertible, so there is certainly one solution; There are infinite solutions if where L is the vector-space of all nodes.
    • Group-spatial reuse is the same as group-temporal but with .
    • The authors find a formula to count the amount of reuse they detected with the previous methods.

Algorithm

  1. Separate the loop-nest into inner-loops which are all mutually interchangeable and outer-loops which are not.
    • Some loops are not possible (I'm not sure why).
  2. Place the inner loops in an order that maximizes reuse, which can be counted according to a formula.
    • This is grows with the number of loops, but the number of loops is generally small in practice.

Evaluation