Twitter as a corpus for sentiment analysis and opinion mining

From AcaWiki

Jump to: navigation, search


Citation: A. Pak, P. Paroubek (2010) Twitter as a corpus for sentiment analysis and opinion mining. LREC (RSS)



Tagged: Computer Science (RSS) Twitter (RSS), opinion mining (RSS), sentiment analysis (RSS), Naive Bayes (RSS)


Summary:

This paper builds and evaluates a sentiment classifier trained on 300,000 tweets of positive, negative, and neutral emotion, using statistical linguistic analysis and a multinomial Naive Bayes classifier.

Methodology

Statistical linguistic analysis of the corpus. After collecting the corpus, they filter to remove URLs, user names, RT, and emoticons. Then they tokenize (keeping words with apostrophes) and remove stopwords. They construct n-grams, keeping the "not" with the previous or following word. Tried SVM, CRF, decided Naive Bayes was best. Try to increase accuracy by removing high entropy and low salience n-grams. Evaluate on a hand-annotated subset.

Corpus

300,000 posts in English. 100,000 containing positive emotion; 100,000 containing negative emotion; 100,000 neutral "that only state a fact or do not express any emotions". Objective texts were queried from newspapers. Positive and negative texts were found by querying for a variety of emoticons ( “:-)”, “:)”, “=)”, “:D” etc. for positive and “:-(”, “:(”, “=(”, “;(” for negative)

See also

Alec Go, Lei Huang, and Richa Bhayani. 2009. Twitter sentiment analysis. Final Projects from CS224N for Spring 2008/2009 at The Stanford Natural Language Processing Group. http://www-nlp.stanford.edu/courses/cs224n/2009/fp/3.pdf

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 347– 354, Morristown, NJ, USA. Association for Computational Linguistics.

Theoretical and practical relevance:

POS tagging is challenging, e.g. "whose" misspelled for "who is"



Personal tools
Namespaces
Variants
Actions
Navigation
New
Tools
Discussion
Help
Toolbox