Mining the peanut gallery: Opinion extraction and semantic classification of product reviews
   Citation: Dave Kushal and Steve Lawrence and David M. Pennock (2003) Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. Proceedings of the 12th international conference on World Wide Web   (RSS)
DOI (original publisher): 10.1145/775152.775226 
Semantic Scholar (metadata): 10.1145/775152.775226
Sci-Hub (fulltext): 10.1145/775152.775226
Internet Archive Scholar (search for fulltext): Mining the peanut gallery: Opinion extraction and semantic classification of product reviews
 Download: http://doi.acm.org/10.1145/775152.775226 
Tagged: Computer Science 
(RSS) opinion mining (RSS), sentiment analysis (RSS), product reviews (RSS), machine learning (RSS), information extraction (RSS)
Summary
This 2003 paper provides a useful guide to related work in several areas:
- objectivity/subjectivity classification
- word classification (e.g. 'textual conjunctions like "fair and legitimate" or "simplistic but well-received" to separate similarly- and oppositely-connoted words.' Predicting the semantic orientation of adjectives).
- sentiment classification
- recommendations
- commercial products
They compare information retrieval approaches with machine learning.
Review Processing
Sources
They process reviews from two consumer websites, Cnet and Amazon. They use two tests: unprocessed reviews and balanced (e.g. equal numbers of positive and negative reviews) numbers of randomly selected reviews.
Substitutions
They try several kinds of substitutions:
- number and category (e.g. replacing the product names with a generic variable)
- linguistic substitutions using Wordnet colocations
- Porter's stemming
- negatives
- N-grams and proximity
- substrings
Overgeneralization seems to cause many problems with the substitutions chosen.
Outcomes
Then they count:
- how many times each term occurs
- how many documents each term occurs in
- how many categories a term occurs in
- how many categories a term occurs in
They smooth, score the reviews (trying various machine learning algorithms), reweight. Now they can classify new documents based on the feature vectors of these documents. They detail further experiments, such as scaling the feature records.
They present a system called ReviewSeer that collects product mentions from search engines. mining particular products and groups these into categories and give assessments.
As an initial corpus, they select and manually tag 600 sentences (200 for each of 3 products). Many sentences are ambiguous out of context, do not express an opinion, or do not describe the product. They conclude that it is important to first find "coherent, topical opinions".
They also present conclusions and ideas for future work.
Selected References
- Vasileios Hatzivassiloglou and Kathleen R. McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the 35th Annual Meeting of ACL, 1997.
 

