Characterizing microblogs with topic models
Citation: Daniel Rampage, Susan Dumais, Dan Liebling (2010) Characterizing microblogs with topic models. Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (RSS)
Motivated by a model of user behavior from interviews and user surveys (at Microsoft), this paper argues that better models of tweets would be useful for two major problems Twitter users have: finding new users and topics to follow, and filtering out "noise" in feeds.
To model Tweets, the paper uses machine learning techniques. Training data consists of hashtags, replies, emoticons, @user labels, reply, question and the model is Labeled LDA, an extension of Latent Direichlet Allocation (2003).
Data used was 8,214,019 Twitter posts from one week in November 2009.
Terms ("200 latent dimensions", following the run) were manually labelled by four raters, with the ("4S") dimensions
These dimensions first arose in these user interviews. "At the word level, Twitter is 11% substance, 5% status, 16% style, 10% social, and 56% other."
The data is used for two tasks:
- ranking posts from a user's current feed
- recommending new users to follow
which are tested with users at Microsoft.
- Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research
- Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-label corpora. EMNLP 2009.
Theoretical and practical relevance:
Labeled LDA, the technique used, could be useful for other studies, and their notion of what data to provide is interesting. The "4S" dimensions could be validated by futher studies.
This was published in an open access journal.