Citation: A. Pak, P. Paroubek (2010) Twitter as a corpus for sentiment analysis and opinion mining. LREC (RSS)
Internet Archive Scholar (search for fulltext): Twitter as a corpus for sentiment analysis and opinion mining
Tagged: Computer Science (RSS) Twitter (RSS), opinion mining (RSS), sentiment analysis (RSS), Naive Bayes (RSS)

Summary

This paper builds and evaluates a sentiment classifier trained on 300,000 tweets of positive, negative, and neutral emotion, using statistical linguistic analysis and a multinomial Naive Bayes classifier. http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf

Methodology

Statistical linguistic analysis of the corpus. After collecting the corpus, they filter to remove URLs, user names, RT, and emoticons. Then they tokenize (keeping words with apostrophes) and remove stopwords. They construct n-grams, keeping the "not" with the previous or following word. Tried SVM, CRF, decided Naive Bayes was best. Try to increase accuracy by removing high entropy and low salience n-grams. Evaluate on a hand-annotated subset.

Corpus

300,000 posts in English. 100,000 containing positive emotion; 100,000 containing negative emotion; 100,000 neutral "that only state a fact or do not express any emotions". Objective texts were queried from newspapers. Positive and negative texts were found by querying for a variety of emoticons ( “:-)”, “:)”, “=)”, “:D” etc. for positive and “:-(”, “:(”, “=(”, “;(” for negative)

Theoretical and Practical Relevance

POS tagging is challenging, e.g. "whose" misspelled for "who is"

Twitter as a corpus for sentiment analysis and opinion mining

Summary

Methodology

Corpus

See also

Theoretical and Practical Relevance

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

New

Discussion

Help

Tools