Sciunits: Reusable Research Objects

From AcaWiki
Jump to: navigation, search

Citation: Dai Hai Ton That, Gabriel Fils, Zhihao Yuan, Tanu Malik (2017/11/16) Sciunits: Reusable Research Objects. e-Science (RSS)
DOI (original publisher): 10.1109/eScience.2017.51
arXiv (preprint): arXiv:abs/1707.05731
Semantic Scholar (metadata): 10.1109/eScience.2017.51
Sci-Hub (fulltext): 10.1109/eScience.2017.51
Internet Archive Scholar (search for fulltext): Sciunits: Reusable Research Objects
Download: https://doi.org/10.1109/eScience.2017.51
Tagged: computational science (RSS)

Summary

Background

  • Research objects := collections of digital artifacts (e.g. code, data, scripts, and temporary experiment result).

Contribution

  • Sciunit := a research object collected automatically by application-virtualization
  • Application virtualization := use strace to collect spawned processes and file opens.
    • Can modify and rerun the container manually after capture.
    • This constructs a graph over processes, their spawn, their file inputs, and file outputs
    • Naively, this is too fine-grained and generates too many dependencies.
      • Deduplicate by checking the rolling hash of each file against existing files. This handles insertion/deletion.
    • Naively, this is too many things to visualize, so the authors develop a way of contracting the graph.

Evaluation

  • Running in container has an overhead (0 -- 40%, depending on the application).
  • It takes a while to deduplicate a new stream for storage (~60s), but reconstructing is fast (<5s).

See also