Citation: Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg (2025/08/09) Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face.
Internet Archive Scholar (search for fulltext): Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
Wikidata (metadata): Q135972441
Download: https://arxiv.org/abs/2508.06811
Tagged:

Summary

The paper empirically maps the open ML ecosystem on Hugging Face by analyzing ~1.86 million models and reconstructing “family trees” that link fine-tuned descendants to their base models. Using metadata and model-card text, the authors quantify trait drift along lineages and find that (i) siblings resemble each other more than parent–child pairs (fast, directed mutations), (ii) licenses tend to drift from restrictive/commercial toward permissive or copyleft (sometimes misaligned with upstream terms), (iii) models drift from multilingual to English-only compatibility, and (iv) model cards shorten and standardize, with increased use of templated/auto-generated text.

Theoretical and Practical Relevance

The work contributes ecosystem-level evidence on how open-weight development evolves, adapting an evolutionary/phylogenetic lens to quantify lineage structure and mutation in model traits. Practically, the findings bear on governance and platform design: license drift highlights compliance and stewardship challenges for derivative models; language and documentation drift inform evaluation reliability and discoverability; and the observed lineage dynamics suggest that norms and instrumentation at the base-model level can shape downstream behavior across large families.

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

Summary

Theoretical and Practical Relevance

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

New

Discussion

Help

Tools