Characterizing microblogs with topic models

{{Summary
 * title=Characterizing Microblogs with Topic Models
 * authors=Daniel Rampage, Susan Dumais, Dan Liebling
 * url=http://aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1528/1846
 * tags=Twitter, microblogging, streams, filtering, machine learning
 * summary=Motivated by a model of user behavior from interviews and user surveys (at Microsoft), this paper argues that better models of tweets would be useful for two major problems Twitter users have: finding new users and topics to follow, and filtering out "noise" in feeds.

To model Tweets, the paper uses machine learning techniques. Training data consists of hashtags, replies, emoticons, @user labels, reply, question and the model is Labeled LDA, an extension of Latent Direichlet Allocation (2003).

Data used was 8,214,019 Twitter posts from one week in November 2009.

Terms ("200 latent dimensions", following the run) were manually labelled by four raters, with the ("4S") dimensions These dimensions first arose in these user interviews. "At the word level, Twitter is 11% substance, 5% status, 16% style, 10% social, and 56% other."
 * 1) substance
 * 2) social
 * 3) status
 * 4) style
 * 5) other

The data is used for two tasks: which are tested with users at Microsoft.
 * 1) ranking posts from a user's current feed
 * 2) recommending new users to follow

Selected References

 * Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research
 * Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-label corpora. EMNLP 2009.
 * relevance=Labeled LDA, the technique used, could be useful for other studies, and their notion of what data to provide is interesting. The "4S" dimensions could be validated by futher studies.

See a summary and discussion of other papers using LDA on Twitter }}
 * journal=Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media
 * pub_date=2010
 * subject=Computer Science
 * pub_open_access=yes