Public Git Archive: a Big Code dataset for all
Citation: Vadim Markovtsev, Waren Long (2018/03/20) Public Git Archive: a Big Code dataset for all.
DOI (original publisher): 10.1145/3196398.3196464
Semantic Scholar (metadata): 10.1145/3196398.3196464
Sci-Hub (fulltext): 10.1145/3196398.3196464
Internet Archive Scholar (search for fulltext): Public Git Archive: a Big Code dataset for all
Download: https://arxiv.org/abs/1803.10144
Tagged:
Summary
Describes the Public Git Archive, built by crawling public repositories on GitHub with at least 50 "stargazers" (182k projects, 3.0TB storage size) identified using the GHTorrent dataset, but metadata collected with independent tools rather than relying on the GitHub API, as a longer term plan is to ingest repositories hosted elsewhere.
Also describes data retrieval pipeline used to produce the dataset, designed to scale horizontally to process millions of repositories, and the Siva repository archival format used to efficiently store forks.
Previous public source datasets have been language-specific (Java) and much smaller, or have not yet enabled public access (Software Heritage). Related datasets such as GHTorrent collect repository activity data rather than source code repositories.