Twitter as a corpus for sentiment analysis and opinion mining

{{Summary http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf
 * title=Twitter as a corpus for sentiment analysis and opinion mining
 * authors=A. Pak, P. Paroubek
 * tags=Twitter, opinion mining, sentiment analysis, Naive Bayes
 * summary=This paper builds and evaluates a sentiment classifier trained on 300,000 tweets of positive, negative, and neutral emotion, using statistical linguistic analysis and a multinomial Naive Bayes classifier.

Methodology
Statistical linguistic analysis of the corpus. After collecting the corpus, they filter to remove URLs, user names, RT, and emoticons. Then they tokenize (keeping words with apostrophes) and remove stopwords. They construct n-grams, keeping the "not" with the previous or following word. Tried SVM, CRF, decided Naive Bayes was best. Try to increase accuracy by removing high entropy and low salience n-grams. Evaluate on a hand-annotated subset.

Corpus
300,000 posts in English. 100,000 containing positive emotion; 100,000 containing negative emotion; 100,000 neutral "that only state a fact or do not express any emotions". Objective texts were queried from newspapers. Positive and negative texts were found by querying for a variety of emoticons ( “:-)”, “:)”, “=)”, “:D” etc. for positive and  “:-(”, “:(”, “=(”, “;(” for negative)