Citation: Omar Benjelloun, Shiyu Chen, Natasha Noy Google Dataset Search by the Numbers.
Internet Archive Scholar (search for fulltext): Google Dataset Search by the Numbers
Wikidata (metadata): Q101086249
Download: https://arxiv.org/abs/2006.06894
Tagged:

Summary

Authors analyze the Google Dataset Search corpus of metadata. As of March 2020, the corpus contained 28 million datasets from more than 3,700 sites.

Findings include

breakdown of top level domains and domains where data described by Schema.org or DCAT metadata is found
breakdown of human languages on pages describing datasets
99% of datasets described by Schema.org
breakdown of other metadata published with dataset description
breakdown of formats datasets published in; fewer than 1% in a linked data format
large amount of churn and growth (3% deleted, 7% new in a given day?) in published datasets
34% of datasets have metadata specifying license and 44% a download URL

Describe future work to improve metadata or dataset quality.

Theoretical and Practical Relevance