Characterizing microblogs with topic models

From AcaWiki
Jump to: navigation, search

Citation: Daniel Rampage, Susan Dumais, Dan Liebling (2010) Characterizing microblogs with topic models. Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (RSS)
Internet Archive Scholar (search for fulltext): Characterizing microblogs with topic models
Download: http://aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1528/1846
Tagged: Computer Science (RSS) Twitter (RSS), microblogging (RSS), streams (RSS), filtering (RSS), machine learning (RSS)

Summary

Motivated by a model of user behavior from interviews and user surveys (at Microsoft), this paper argues that better models of tweets would be useful for two major problems Twitter users have: finding new users and topics to follow, and filtering out "noise" in feeds.

To model Tweets, the paper uses machine learning techniques. Training data consists of hashtags, replies, emoticons, @user labels, reply, question and the model is Labeled LDA, an extension of Latent Direichlet Allocation (2003).

Data used was 8,214,019 Twitter posts from one week in November 2009.

Terms ("200 latent dimensions", following the run) were manually labelled by four raters, with the ("4S") dimensions

  1. substance
  2. social
  3. status
  4. style
  5. other

These dimensions first arose in these user interviews. "At the word level, Twitter is 11% substance, 5% status, 16% style, 10% social, and 56% other."

The data is used for two tasks:

  1. ranking posts from a user's current feed
  2. recommending new users to follow

which are tested with users at Microsoft.

Selected References

Theoretical and Practical Relevance

Labeled LDA, the technique used, could be useful for other studies, and their notion of what data to provide is interesting. The "4S" dimensions could be validated by futher studies.

See a summary and discussion of other papers using LDA on Twitter

This was published in an open access journal.