DataHub: Collaborative Data Science & Dataset Version Management at Scale
Citation: Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya G. Parameswaran (2015) DataHub: Collaborative Data Science & Dataset Version Management at Scale. 7th Biennial Conference on Innovative Data Systems Research (RSS)
Internet Archive Scholar (search for fulltext): DataHub: Collaborative Data Science & Dataset Version Management at Scale
Download: http://db.csail.mit.edu/pubs/datahubcidr.pdf
Tagged:
Summary
Managing data is a major challenge for researchers. Source code version control is inadequate: no structure, very limited querying, poor performance on huge datasets.
Propose concept of a Dataset Version Control (DSVC) and software/service called DataHub built on DSVC.
Abstraction 1: table containing records each with a key and arbitrary associated data.
Abstraction 2: dataset is a set of tables, and any relations among tables
Versions graph has datasets as nodes, with relationships among versions, including provenance info, as edges.
A version API will be much like git's, including the possibility of hooks.
Research question: how to handle conflicts.
Propose VQL, adds ability to query version graph to SQL.
Research challenge: make VQL easier to use and more complete.
Discuss version-first and record-first storage. Both representations may be needed to support all operations efficiently, but storage costs require archiving and cleanup.
Theoretical and Practical Relevance
Project at https://datahub.csail.mit.edu