Automatic Interlinking of Music Datasets on the Semantic Web

From AcaWiki
Jump to: navigation, search


Citation: Yves Raimond, Christopher Sutton, Mark Sandler (2008) Automatic Interlinking of Music Datasets on the Semantic Web. Linked Data on the Web (RSS)



Download: http://events.linkeddata.org/ldow2008/papers/18-raimond-sutton-automatic-interlinking.pdf

Tagged: musicbrainz (RSS)


Summary:

"we need a way to automatically detect the overlapping parts of heterogeneous datasets. In this paper, we detail a few algorithms that have been developed, implemented and practically deployed to interlink different music-related datasets. We mainly focus on the most sophisticated one, applicable in a Linked Data context, and taking into account not only the similarities of single resources but also the similarities of their neighbours. We evaluate how this algorithm performs when applied to link a real-world Creative Commons dataset to an editorial one. We also show how a personal music collection can be treated as one such dataset, enabling a user to benefit from the growing body of knowledge on the Semantic Web in a personally meaningful way."

Naive Interlinking

  • simple literal lookups: eg string for location in Jamendo dataset (such as "Moselle, France") to look up geonames entry, suitable where literal string reliably disambiguates
  • extended literal lookups: add constraints eg type to simple literal lookup to filter out incorrect matches

Graph matching

  • offline: compute graph similarity measure, map closest match meeting some threshold, eg artists in two datasets with the same releases are probably same artist
  • linked data context: apply graph similarity computation, updated as linked data retrieved, until decision reached (authors provide algorithm definition in pseudo-code)

Experiments using linked data context graph matching

  • authors matched 60 artists in MusicBrainz and Jamendo datasets, obtained 5 accurate matches, 0 inaccurate matches, 53 accurate non-matches (due to most Jamendo artists not having MusicBrainz entries), 2 inaccurate non-matches. Diasambiguation (step beyond naive string match?) needed in 16 cases (eg band named Hair in Jamendo, 4 in MusicBrainz, none of which prove to match). 2 inaccurate non-matches due to implementation mistake when target graph bigger than seed (artist has more releases in MusicBrainz than Jamendo) and outdated MusicBrainz RDF dump (artists was in live MusicBrainz database)
  • authors use GNAT implementation of algorithm to match metadata (artist/album/title) with various artificial errors introduced
  • authors use GNARQL to crawl links output by GNAT, provide SPARQL endpoint and web front end for exploring data, eg world map plotting artist locations in personal music collection

Future work

  • Algorithm could specify heuristics to use if ontologies differ
  • Another algorithm emphasizing graph structure could be called for where domain similarity measure not available
  • GNAT could use audio fingerprinting where no embedded metadata available
  • More work could be done on GNARQL's crawling and in general linked open data publishing practices are immature

Theoretical and practical relevance:

Attempt to realize practical benefit from linked open data.