Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

From AcaWiki
Jump to: navigation, search

Citation: Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg (2025/08/09) Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face.
Internet Archive Scholar (search for fulltext): Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
Wikidata (metadata): Q135972441
Download: https://arxiv.org/abs/2508.06811
Tagged:

Summary

The paper empirically maps the open ML ecosystem on Hugging Face by analyzing ~1.86 million models and reconstructing “family trees” that link fine-tuned descendants to their base models. Using metadata and model-card text, the authors quantify trait drift along lineages and find that (i) siblings resemble each other more than parent–child pairs (fast, directed mutations), (ii) licenses tend to drift from restrictive/commercial toward permissive or copyleft (sometimes misaligned with upstream terms), (iii) models drift from multilingual to English-only compatibility, and (iv) model cards shorten and standardize, with increased use of templated/auto-generated text.

Theoretical and Practical Relevance

The work contributes ecosystem-level evidence on how open-weight development evolves, adapting an evolutionary/phylogenetic lens to quantify lineage structure and mutation in model traits. Practically, the findings bear on governance and platform design: license drift highlights compliance and stewardship challenges for derivative models; language and documentation drift inform evaluation reliability and discoverability; and the observed lineage dynamics suggest that norms and instrumentation at the base-model level can shape downstream behavior across large families.