Mining the peanut gallery: Opinion extraction and semantic classification of product reviews

From AcaWiki
Jump to: navigation, search

Citation: Dave Kushal and Steve Lawrence and David M. Pennock (2003) Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. Proceedings of the 12th international conference on World Wide Web (RSS)
DOI (original publisher): 10.1145/775152.775226
Semantic Scholar (metadata): 10.1145/775152.775226
Sci-Hub (fulltext): 10.1145/775152.775226
Internet Archive Scholar (search for fulltext): Mining the peanut gallery: Opinion extraction and semantic classification of product reviews
Download: http://doi.acm.org/10.1145/775152.775226
Tagged: Computer Science (RSS) opinion mining (RSS), sentiment analysis (RSS), product reviews (RSS), machine learning (RSS), information extraction (RSS)

Summary

This 2003 paper provides a useful guide to related work in several areas:

  • objectivity/subjectivity classification
  • word classification (e.g. 'textual conjunctions like "fair and legitimate" or "simplistic but well-received" to separate similarly- and oppositely-connoted words.' Predicting the semantic orientation of adjectives).
  • sentiment classification
  • recommendations
  • commercial products

They compare information retrieval approaches with machine learning.

Review Processing

Sources

They process reviews from two consumer websites, Cnet and Amazon. They use two tests: unprocessed reviews and balanced (e.g. equal numbers of positive and negative reviews) numbers of randomly selected reviews.

Substitutions

They try several kinds of substitutions:

  • number and category (e.g. replacing the product names with a generic variable)
  • linguistic substitutions using Wordnet colocations
  • Porter's stemming
  • negatives
  • N-grams and proximity
  • substrings

Overgeneralization seems to cause many problems with the substitutions chosen.

Outcomes

Then they count:

  • how many times each term occurs
  • how many documents each term occurs in
  • how many categories a term occurs in
  • how many categories a term occurs in

They smooth, score the reviews (trying various machine learning algorithms), reweight. Now they can classify new documents based on the feature vectors of these documents. They detail further experiments, such as scaling the feature records.

They present a system called ReviewSeer that collects product mentions from search engines. mining particular products and groups these into categories and give assessments.

As an initial corpus, they select and manually tag 600 sentences (200 for each of 3 products). Many sentences are ambiguous out of context, do not express an opinion, or do not describe the product. They conclude that it is important to first find "coherent, topical opinions".

They also present conclusions and ideas for future work.

Selected References