Google Dataset Search by the Numbers
From AcaWiki
Citation: Omar Benjelloun, Shiyu Chen, Natasha Noy Google Dataset Search by the Numbers.
Internet Archive Scholar (search for fulltext): Google Dataset Search by the Numbers
Wikidata (metadata): Q101086249
Download: https://arxiv.org/abs/2006.06894
Tagged:
Summary
Authors analyze the Google Dataset Search corpus of metadata. As of March 2020, the corpus contained 28 million datasets from more than 3,700 sites.
Findings include
- breakdown of top level domains and domains where data described by Schema.org or DCAT metadata is found
- breakdown of human languages on pages describing datasets
- 99% of datasets described by Schema.org
- breakdown of other metadata published with dataset description
- breakdown of formats datasets published in; fewer than 1% in a linked data format
- large amount of churn and growth (3% deleted, 7% new in a given day?) in published datasets
- 34% of datasets have metadata specifying license and 44% a download URL
Describe future work to improve metadata or dataset quality.
Theoretical and Practical Relevance
https://ai.googleblog.com/2020/08/an-analysis-of-online-datasets-using.html
Advice for publishers https://support.google.com/webmasters/thread/1960710