Ontology-based modelling of related work sections in research articles: Using CRFs for developing semantic data based information retrieval systems

From AcaWiki
Jump to: navigation, search

Citation: M. A. Angrosh, Stephen Cranefield, Nigel Stanger (2010) Ontology-based modelling of related work sections in research articles: Using CRFs for developing semantic data based information retrieval systems. I-SEMANTICS 2010 (RSS)
DOI (original publisher): 10.1145/1839707.1839725
Semantic Scholar (metadata): 10.1145/1839707.1839725
Sci-Hub (fulltext): 10.1145/1839707.1839725
Internet Archive Scholar (search for fulltext): Ontology-based modelling of related work sections in research articles: Using CRFs for developing semantic data based information retrieval systems
Tagged: Computer Science (RSS) ontologies (RSS), CRFs (RSS), conditional random fields (RSS), information retrieval (RSS), Semantic Web (RSS), RDF (RSS), SPARQL (RSS)

Summary

This paper uses conditional random fields (a supervised machine learning technique) to classify related work sections of research articles and provide "descriptive sentences". The corpus is the related work sections from 50 research articles in Springer's LNCS (1063 sentences in 200 paragraphs, manually annotated with the sets of features described below). The resulting learned classification is marked up with RDF, using the classification as a sentence context ontology; consequently this underlies an ontology-based information retrieval system, which uses SPARQL.

This work draws from argumentative zoning, especially Argumentative Zoning for improved citation indexing. In screenshots of a citation-based contextual information system, the authors show the citation sentences and shortcomings of the work (similar to what was presented in Argumentative Zoning for improved citation indexing).

One major difference seems to be the features used, which are simpler than those in Argumentative Zoning for improved citation indexing or Discourse-level argumentation in scientific articles: Human and automatic annotation. In particular, they focus on nine categories of terms. along with 2 citation features (whether this sentence has a citation, or the previous sentence has a citation), and two "compound features". The compound features indicate, for each sentence with a subject of inquiry term (e.g. examine, propose, state), whether it had a citation or no citation.

These are the categories of terms:

  1. Inquiry terms (examine, propose, state)
  2. Outcome terms (show, develop)
  3. Strength terms (improve, better performance, aids)
  4. Shortcoming terms in related work (Nevertheless, do not, however, but)
  5. Subjective pronouns (They, The authors, In)
  6. Words of stress (Moreover, In addition, Therefore)
  7. Alternate approach terms (may be, another approach, alternative)
  8. Result Terms (we have shown, our work shows, this paper)
  9. Contrasting Terms (In contrast, our work differs from that of)

Using both citation features and sentence features improves the classification.

Information that can be retrieved

The authors provide the following list of information that can be retrieved using their system (citation context mining + ontology + SPARQL endpoint)

  1. List current work outcomes and current work shortcomings of a research article
  2. List the context of related work cited by author in the article
  3. List the outcomes of related work mentioned by the author in the article
  4. List related work which strongly support the article (related work strengths)
  5. List the shortcomings of a specific cited work in an article
  6. List alternative statements for a cited work (as opined by the author)
  7. List contrasting work for a cited work
  8. Identify the use of a cited work in different articles
  9. Identify the context in which a cited work is used in different article

Theoretical and Practical Relevance

This is a proof-of-concept of a semantic information retrieval system which focuses on citations. Future work could apply Argumentative Zoning for improved citation indexing more rigorously, focusing, for instance on the CONTRAST and BASIS terms, or on paradigm shifts as in Sandor's work (among others: Christine Chichester, Frdrique Lisacek, Aaron Kaplan, and Agnes Sandor. 2005. Discovering paradigm shift patterns in biomedical abstracts: Application to neurodegenerative diseases. In Proceedings of First International Symptosium on Semantic Mining in Biomedicine. ).