Using automatic persistent memoization to facilitate data analysis scripting
Citation: Philip J Guo, Dawson Engler (2011/07) Using automatic persistent memoization to facilitate data analysis scripting. ISSTA '11: Proceedings of the 2011 International Symposium on Software Testing and Analysis (RSS)
DOI (original publisher): 10.1145/2001420.2001455
Semantic Scholar (metadata): 10.1145/2001420.2001455
Sci-Hub (fulltext): 10.1145/2001420.2001455
Internet Archive Scholar (search for fulltext): Using automatic persistent memoization to facilitate data analysis scripting
Download: https://dl.acm.org/doi/abs/10.1145/2001420.2001455
Tagged: Computer Science
(RSS) software engineering (RSS)
Problem
- Programmers across a wide range of disciplines write scripts to transform, process, or analyze data in Python, their scripts are divided into stages with intermediate results, and they develop iteratively.
- Data-analysis is often ad hoc and exploratory.
- Data-analysts have a wide range of backgrounds; not strictly CS/engineering.
- They primarily use Python, Perl, or Ruby.
- Saving intermediate outputs by-hand is time-consuming and error-prone. Users forget to invalidate datasets.
- How to speed up their programs by saving intermediate outputs without being stale or confusing?
Solution: IncPy
IncPy is a modified Python interpreter that automatically memoizes functions. It invalidates a function's cache if its source code changes, and invalidates an entry if a file it depends on changes.
- Benefits:
- Less code (no de/serialization or cache testing)
- Automated data management
- Faster iteration times
- Source code
- Goals:
- No code changes (change interpreter instead)
- Low run-time overhead
- Compatible with 3rd party libraries.
Implementation
- Each entry contains: function name, arguments, return, stdout, stderr, global vars dependencies, files read, files written, transitive code dependencies (name and bytecode).
- Modify interpreter's file IO calls to record file accesses to all functions on call stack.
- Each entry is a file named with a hash of the inputs. This has to be written atomically, so multiple threads/processes can share the same cache.
- Entries are written eagerly; if interpreter crashes, still have entries.
- Which calls to memoize: Only pure, time-consuming calls.
- When an impurity is detected (mutation of {closure, global, or argument}, source of non-determinism {PRNG, stdin, time}) all functions on call stack.
- Functions that are pure on some paths but impure on others can still be memoized for those pure paths, on which we have IncPy has dynamically proven purity.
- File IO does not count as impurity since these are captured in the entry.
- Heuristic: If the call is shorter than 1 second, overhead of memoization (disk read/write) is often not worthwhile. Otherwise, memoize.
- Heuristic doesn't work if returned data is very large.
- When an impurity is detected (mutation of {closure, global, or argument}, source of non-determinism {PRNG, stdin, time}) all functions on call stack.
- Reachability analysis determines if global is mutated (impure) or if global is read (add dependency).
- Each object has two more fields: global name and func start.
- Global name is the global this object is reachable from.
- Func start is the time the variable was created.
- If there is a read or write to a variable whose func start predates the current function, it must be marked as a dependency or impurity, respectively.
- Each object has two more fields: global name and func start.
Generalizability
- For dynamic languages, interpose the interpreter's handlers for function calls, value accesses, and file I/O (as does IncPy)
- For static languages, source-to-source translator or binary rewriter that inserts callback.
- Could augment with static purity, escape, and callgraph analysis.
- For whole workflows, use ptrace to identify file read/writes.
- Perhaps could be used to accelerate regression tests.
Related Work
- Make, but Make requires explicit dependency information and only works between processes.
- Self-adjusting computation, but IncPy is more general.
- JIT, although it is focused on program micro-optimizations, not memoization, and these don't outlive the process.
Future Work
- Provenance browsing
- Network‐aware/database‐aware caching
- Lightweight programmer annotations
- Finer‐grained tracking within functions