DataHub: Collaborative Data Science & Dataset Version Management at Scale

From AcaWiki
Jump to: navigation, search

Citation: Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya G. Parameswaran (2015) DataHub: Collaborative Data Science & Dataset Version Management at Scale. 7th Biennial Conference on Innovative Data Systems Research (RSS)
Internet Archive Scholar (search for fulltext): DataHub: Collaborative Data Science & Dataset Version Management at Scale
Download: http://db.csail.mit.edu/pubs/datahubcidr.pdf
Tagged:

Summary

Managing data is a major challenge for researchers. Source code version control is inadequate: no structure, very limited querying, poor performance on huge datasets.


Propose concept of a Dataset Version Control (DSVC) and software/service called DataHub built on DSVC.

Abstraction 1: table containing records each with a key and arbitrary associated data.

Abstraction 2: dataset is a set of tables, and any relations among tables

Versions graph has datasets as nodes, with relationships among versions, including provenance info, as edges.

A version API will be much like git's, including the possibility of hooks.

Research question: how to handle conflicts.

Propose VQL, adds ability to query version graph to SQL.

Research challenge: make VQL easier to use and more complete.

Discuss version-first and record-first storage. Both representations may be needed to support all operations efficiently, but storage costs require archiving and cleanup.

Theoretical and Practical Relevance

Project at https://datahub.csail.mit.edu