Leave a reply: An analysis of weblog comments

{{Summary
 * title=Leave a reply: An analysis of weblog comments
 * authors=Gilad Mishne and Natalie Glance
 * url=http://www.blogpulse.com/www2006-workshop/papers/wwe2006-blogcomments.pdf
 * tags=blogging, comments, argumentation, replies, online argumentation
 * summary=The authors study blog comments and the argumentative nature of (some of) those comments. First they generate a comment corpus of about 645,000 comments, using all active blogs in the Blogpulse index, and, since comment information is generally unsyndicated, run an extraction process to generate the comments.

The extraction process algorithmically identifies the comment region, find dates (which are typical to comments) within that region, and expanding each date to a complete comment by analyzing the text around it.

They manually evaluate the coverage of this technique on a set of 500 randomly-selected posts. It is successful on posts with no comments (only 3% false positives) but only 65% correct on posts with comments. However, they note that retrieving the number of comments, rather than the content, was easier, so that there was 70% success on this sort of task. Further, multiple languages made this task harder, and on English pages, they estimate coverage at 80%

15% of posts and 28% of blogs had comments; commenting was disabled in fewer than 20% of webblogs in their sample.

They "estimate that the number of blog comments in the entire blogosphere is comparable to the number of posts in active, non-spam blogs", i.e. 15-30% of the (numerical) size of the blogosphere. Due to the relatively small size of each comment, they estimate that the text size is 10-20%.

They use comments to augment the post and find a "notable contribution" of comment content to overall recall. They note that sometimes posts are "almost-empty", with just a link and a short remark (e.g. 'unbelievable'). In that case, comments provide more content, along with keywords that give context and allow retrieval.

They compare comment volume with other popularity and influence measures. They use the Blogpulse index as a source of the number of incoming links, then the number of page views according to Sitemeter. They had both pieces of information for 8,824 blogs; however, their corpus contained 36,044 blogs, including 10,132 commented blogs. They conclude that "the existence of many comments" indicates popularity of a post, and then discuss outliers: some high-ranked blogs moderate or disable comments, leading to "too few" comments. Further, some low-ranked blogs have "too many" comments: these tend to be chat-oriented comments on personal journals, which may not relate closely to the posts, or non-tech blogs, which may not be heavily linked to (compared to tech blogs), based on the difference in readership.

Posts which are relatively highly-commented (compared to the median in that blog) tend to be on highly-controversial topics or posts that received a high level of traffic (i.e. from mainstream media).

Disputative comments
As opposed to "thanks" comments, or personally-oriented comments (posted by friends), the authors distinguish disputative comments which disagree with the blogger or other commenters, "forming an online debate".

About 21% of the comment threads were tagged disputative by a classifier; they are largely about politics.

Classifier
They built a classifier to detect disagreement in comments based on
 * frequency counts of words, bigrams, and manually-built lexicon of disputative phrases ("I don't think that", "you are wrong", etc.)
 * level of subjectivity ("I believe that", "In my opinion", "I don't", "you have to" -- which appear more often in disputative comments)
 * They built this in an interesting way, comparing Wikipedia entries and Talk pages with log-likelihood corpus divergence metric (see Comparing corpora)
 * length (average sentence length, average comment length in the thread, number of comments in the thread) -- disputative comments tend to be longer and appear in longer threads
 * punctuation (frequency counts of punctuation symbols, features for excessive punctuation--see A Bayesian approach to ﬁltering junk e-mail)
 * polarity
 * referral -- to a quote from the post, to a quote from another comment, referencing other authors' names, increased using of second-person (existence as well as location in the post)

Most important classifier features

 * A referral quote in the first part of the comments
 * Use of a question mark early in the comments
 * Disputative phrases
 * Number of comments in the thread
 * Polarity of the first sentence of the comments
 * Level of subjectivity
 * Important words and bigrams: pronouns, negating words ("not", "but")

Selected References

 * S. C. Herring, L. A. Scheidt, S. Bonus, and E. Wright. Bridging the gap: A genre analysis of weblogs. In The 37th Annual Hawaii International Conference on System Sciences (HICSS’04), 2004.
 * A. Kilgarriﬀ. Comparing corpora. International Journal of Corpus Linguistics, 6(1):1–37, 2001.
 * A. de Moor and L. Eﬁmova. An argumentation analysis of weblog conversations. In The 9th International Working Conference on the Language-Action Perspective on Communication Modelling (LAP 2004), 2004.
 * M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to ﬁltering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. AAAI Technical Report WS-98-05.
 * relevance=http://asist.typepad.com/sig_bwp/2006/04/leave_a_reply_a.html

Comments make up about 30% of the volume of blog posts. }}
 * journal=Third Annual Workshop on the Weblogging Ecosystem, WWW 2006
 * pub_date=2006
 * subject=Computer Science