Citation: Philip J Guo, Dawson Engler (2011/07) Using automatic persistent memoization to facilitate data analysis scripting. ISSTA '11: Proceedings of the 2011 International Symposium on Software Testing and Analysis (RSS)
DOI (original publisher): 10.1145/2001420.2001455
Semantic Scholar (metadata): 10.1145/2001420.2001455
Sci-Hub (fulltext): 10.1145/2001420.2001455
Internet Archive Scholar (search for fulltext): Using automatic persistent memoization to facilitate data analysis scripting
Download: https://dl.acm.org/doi/abs/10.1145/2001420.2001455
Tagged: Computer Science (RSS) software engineering (RSS)

Problem

Programmers across a wide range of disciplines write scripts to transform, process, or analyze data in Python, their scripts are divided into stages with intermediate results, and they develop iteratively.
Data-analysis is often ad hoc and exploratory.
Data-analysts have a wide range of backgrounds; not strictly CS/engineering.
They primarily use Python, Perl, or Ruby.
Saving intermediate outputs by-hand is time-consuming and error-prone. Users forget to invalidate datasets.
How to speed up their programs by saving intermediate outputs without being stale or confusing?

Solution: IncPy

IncPy is a modified Python interpreter that automatically memoizes functions. It invalidates a function's cache if its source code changes, and invalidates an entry if a file it depends on changes.

Benefits:
- Less code (no de/serialization or cache testing)
- Automated data management
- Faster iteration times
Source code
Goals:
- No code changes (change interpreter instead)
- Low run-time overhead
- Compatible with 3rd party libraries.

Implementation

Each entry contains: function name, arguments, return, stdout, stderr, global vars dependencies, files read, files written, transitive code dependencies (name and bytecode).
Modify interpreter's file IO calls to record file accesses to all functions on call stack.
Each entry is a file named with a hash of the inputs. This has to be written atomically, so multiple threads/processes can share the same cache.
Entries are written eagerly; if interpreter crashes, still have entries.
Which calls to memoize: Only pure, time-consuming calls.
- When an impurity is detected (mutation of {closure, global, or argument}, source of non-determinism {PRNG, stdin, time}) all functions on call stack.
  - Functions that are pure on some paths but impure on others can still be memoized for those pure paths, on which we have IncPy has dynamically proven purity.
  - File IO does not count as impurity since these are captured in the entry.
- Heuristic: If the call is shorter than 1 second, overhead of memoization (disk read/write) is often not worthwhile. Otherwise, memoize.
  - Heuristic doesn't work if returned data is very large.
Reachability analysis determines if global is mutated (impure) or if global is read (add dependency).
- Each object has two more fields: global name and func start.
  - Global name is the global this object is reachable from.
  - Func start is the time the variable was created.
  - If there is a read or write to a variable whose func start predates the current function, it must be marked as a dependency or impurity, respectively.

Generalizability

For dynamic languages, interpose the interpreter's handlers for function calls, value accesses, and file I/O (as does IncPy)
For static languages, source-to-source translator or binary rewriter that inserts callback.
- Could augment with static purity, escape, and callgraph analysis.
For whole workflows, use ptrace to identify file read/writes.
Perhaps could be used to accelerate regression tests.

Related Work

Make, but Make requires explicit dependency information and only works between processes.
Self-adjusting computation, but IncPy is more general.
JIT, although it is focused on program micro-optimizations, not memoization, and these don't outlive the process.

Future Work

Provenance browsing
Network‐aware/database‐aware caching
Lightweight programmer annotations
Finer‐grained tracking within functions

Using automatic persistent memoization to facilitate data analysis scripting

Problem

Solution: IncPy

Implementation

Generalizability

Related Work

Future Work

See Also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

New

Discussion

Help

Tools