Google Dataset Search by the Numbers

From AcaWiki
Jump to: navigation, search

Citation: Omar Benjelloun, Shiyu Chen, Natasha Noy Google Dataset Search by the Numbers.
Internet Archive Scholar (search for fulltext): Google Dataset Search by the Numbers
Wikidata (metadata): Q101086249
Download: https://arxiv.org/abs/2006.06894
Tagged:

Summary

Authors analyze the Google Dataset Search corpus of metadata. As of March 2020, the corpus contained 28 million datasets from more than 3,700 sites.

Findings include

  • breakdown of top level domains and domains where data described by Schema.org or DCAT metadata is found
  • breakdown of human languages on pages describing datasets
  • 99% of datasets described by Schema.org
  • breakdown of other metadata published with dataset description
  • breakdown of formats datasets published in; fewer than 1% in a linked data format
  • large amount of churn and growth (3% deleted, 7% new in a given day?) in published datasets
  • 34% of datasets have metadata specifying license and 44% a download URL

Describe future work to improve metadata or dataset quality.

Theoretical and Practical Relevance

https://ai.googleblog.com/2020/08/an-analysis-of-online-datasets-using.html

Advice for publishers https://support.google.com/webmasters/thread/1960710