Software Heritage: Why and How to Preserve Software Source Code

From AcaWiki
Jump to: navigation, search

Citation: Roberto Di Cosmo, Stefano Zacchiroli (2017) Software Heritage: Why and How to Preserve Software Source Code. iPRES 2017: 14th International Conference on Digital Preservation (RSS)
Internet Archive Scholar (search for fulltext): Software Heritage: Why and How to Preserve Software Source Code
Download: https://hal.archives-ouvertes.fr/hal-01590958
Tagged:

Summary

Overview and status of the Software Heritage project.

Reviews existing work, claims "software archival in source code form has not been addressed in its own right before."

Source code is at risk: "diaspora" to many platforms and institutional forges, shutdowns/fragility of same, and lack of research instrument to analyze the whole of software; a "very large telescope" of software is needed.

Missing of Software Heritage is to "collect, organize, preserve, and make easily accessible all publicly available source code" using the following principles to achieve this:

  • transparency and free software
  • replication
  • multi-stakeholder and non-profit
  • no a priori selection (save all the code)
  • source code first (other projects including some discussed as prior work archive context such as development mailing lists and binaries created from source code)
  • intrinsic identifiers
  • provenance of facts
  • minimalism

Outlines applications in cultural heritage, scientific research, and industrial uses (such as "part numbers" for free software).


Describes technical design, which takes advantage of challenge/opportunity of massive duplication of published source code: "Software Heritage archive is conceptually a single (big) Merkle Direct Acyclic Graph".

This DAG is populated through workflows that include:

  • listing
  • loading
  • scheduling
  • archiving

Briefly describes initial implementation progress of above workflows and access to the archive.