Unsupervised modeling of Twitter conversations

From AcaWiki
Jump to: navigation, search

Citation: A. Ritter, C. Cherry, B. Dolan (2010) Unsupervised modeling of Twitter conversations. HCT-NAACL (RSS)
Internet Archive Scholar (search for fulltext): Unsupervised modeling of Twitter conversations
Download: http://www.cs.washington.edu/homes/aritter/twitter chat.pdf
Tagged: Computer Science (RSS) Twitter (RSS), LDA (RSS), dialogue acts (RSS), speech act theory (RSS)

Summary

This paper models dialog acts in Twitter conversations and presents a corpus of 1.3 million conversations. They provide a status diagram showing the likelihood of transitions between dialogue acts.

Transitions between dialogue acts in Twitter conversations.png

Methodology

Unsupervised LDA modelling of Twitter conversations, evaluated by held-out test conversations. Uses a conversation+topic model (segmenting post words into those that involve the topic of conversation, the dialogue act, or something else). Trained on 10,000 randomly sampled conversations (conversation length 3-6) from the corpus.

Corpus

1.3 million conversations with each conversation containing between 2 and 243 posts. In summer 2009, they selected a random sample of Twitter users by gathering 20 randomly selected posts per minute, then queried to get all their posts. Followed any replies to collect conversations. Removed non-English conversations and non-reply posts.

See also

  • Jcluster word clustering algorithm (Joshua T. Goodman. 2001. A bit of progress in language modeling. Technical report.)