Twitter as a corpus for sentiment analysis and opinion mining
Citation: A. Pak, P. Paroubek (2010) Twitter as a corpus for sentiment analysis and opinion mining. LREC (RSS)
Internet Archive Scholar (search for fulltext): Twitter as a corpus for sentiment analysis and opinion mining
Tagged: Computer Science
(RSS) Twitter (RSS), opinion mining (RSS), sentiment analysis (RSS), Naive Bayes (RSS)
Summary
This paper builds and evaluates a sentiment classifier trained on 300,000 tweets of positive, negative, and neutral emotion, using statistical linguistic analysis and a multinomial Naive Bayes classifier. http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf
Methodology
Statistical linguistic analysis of the corpus. After collecting the corpus, they filter to remove URLs, user names, RT, and emoticons. Then they tokenize (keeping words with apostrophes) and remove stopwords. They construct n-grams, keeping the "not" with the previous or following word. Tried SVM, CRF, decided Naive Bayes was best. Try to increase accuracy by removing high entropy and low salience n-grams. Evaluate on a hand-annotated subset.
Corpus
300,000 posts in English. 100,000 containing positive emotion; 100,000 containing negative emotion; 100,000 neutral "that only state a fact or do not express any emotions". Objective texts were queried from newspapers. Positive and negative texts were found by querying for a variety of emoticons ( “:-)”, “:)”, “=)”, “:D” etc. for positive and “:-(”, “:(”, “=(”, “;(” for negative)
See also
- Interesting tool from
Alec Go, Lei Huang, and Richa Bhayani. 2009. Twitter sentiment analysis. Final Projects from CS224N for Spring 2008/2009 at The Stanford Natural Language Processing Group. http://www-nlp.stanford.edu/courses/cs224n/2009/fp/3.pdf
- Keeping the "not" with the previous or following word:
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 347– 354, Morristown, NJ, USA. Association for Computational Linguistics.
Theoretical and Practical Relevance
POS tagging is challenging, e.g. "whose" misspelled for "who is"