Using automatic persistent memoization to facilitate data analysis scripting

From AcaWiki
Jump to: navigation, search

Citation: Philip J Guo, Dawson Engler (2011/07) Using automatic persistent memoization to facilitate data analysis scripting. ISSTA '11: Proceedings of the 2011 International Symposium on Software Testing and Analysis (RSS)
DOI (original publisher): 10.1145/2001420.2001455
Semantic Scholar (metadata): 10.1145/2001420.2001455
Sci-Hub (fulltext): 10.1145/2001420.2001455
Internet Archive Scholar (search for fulltext): Using automatic persistent memoization to facilitate data analysis scripting
Download: https://dl.acm.org/doi/abs/10.1145/2001420.2001455
Tagged: Computer Science (RSS) software engineering (RSS)



Problem

  • Programmers across a wide range of disciplines write scripts to transform, process, or analyze data in Python, their scripts are divided into stages with intermediate results, and they develop iteratively.
  • Data-analysis is often ad hoc and exploratory.
  • Data-analysts have a wide range of backgrounds; not strictly CS/engineering.
  • They primarily use Python, Perl, or Ruby.
  • Saving intermediate outputs by-hand is time-consuming and error-prone. Users forget to invalidate datasets.
  • How to speed up their programs by saving intermediate outputs without being stale or confusing?

Solution: IncPy

IncPy is a modified Python interpreter that automatically memoizes functions. It invalidates a function's cache if its source code changes, and invalidates an entry if a file it depends on changes.

  • Benefits:
    • Less code (no de/serialization or cache testing)
    • Automated data management
    • Faster iteration times
  • Source code
  • Goals:
    • No code changes (change interpreter instead)
    • Low run-time overhead
    • Compatible with 3rd party libraries.

Implementation

  • Each entry contains: function name, arguments, return, stdout, stderr, global vars dependencies, files read, files written, transitive code dependencies (name and bytecode).
  • Modify interpreter's file IO calls to record file accesses to all functions on call stack.
  • Each entry is a file named with a hash of the inputs. This has to be written atomically, so multiple threads/processes can share the same cache.
  • Entries are written eagerly; if interpreter crashes, still have entries.
  • Which calls to memoize: Only pure, time-consuming calls.
    • When an impurity is detected (mutation of {closure, global, or argument}, source of non-determinism {PRNG, stdin, time}) all functions on call stack.
      • Functions that are pure on some paths but impure on others can still be memoized for those pure paths, on which we have IncPy has dynamically proven purity.
      • File IO does not count as impurity since these are captured in the entry.
    • Heuristic: If the call is shorter than 1 second, overhead of memoization (disk read/write) is often not worthwhile. Otherwise, memoize.
      • Heuristic doesn't work if returned data is very large.
  • Reachability analysis determines if global is mutated (impure) or if global is read (add dependency).
    • Each object has two more fields: global name and func start.
      • Global name is the global this object is reachable from.
      • Func start is the time the variable was created.
      • If there is a read or write to a variable whose func start predates the current function, it must be marked as a dependency or impurity, respectively.

Generalizability

  • For dynamic languages, interpose the interpreter's handlers for function calls, value accesses, and file I/O (as does IncPy)
  • For static languages, source-to-source translator or binary rewriter that inserts callback.
    • Could augment with static purity, escape, and callgraph analysis.
  • For whole workflows, use ptrace to identify file read/writes.
  • Perhaps could be used to accelerate regression tests.

Related Work

  • Make, but Make requires explicit dependency information and only works between processes.
  • Self-adjusting computation, but IncPy is more general.
  • JIT, although it is focused on program micro-optimizations, not memoization, and these don't outlive the process.

Future Work

  • Provenance browsing
  • Network‐aware/database‐aware caching
  • Lightweight programmer annotations
  • Finer‐grained tracking within functions

See Also