What Wikipedia deletes: Characterizing dangerous collaborative content

From AcaWiki
Jump to: navigation, search

Citation: Andrew G. West and Insup Lee (2011) What Wikipedia deletes: Characterizing dangerous collaborative content. Proceedings of the International Symposium on Wikis and Open Collaboration (RSS)

Download: https://repository.upenn.edu/cis_papers/478/

Tagged: Computer Science (RSS) Wikipedia (RSS), user-generated content (RSS), deletion (RSS), redaction (RSS), information security (RSS), revision deletion (RSS)


This paper analyzes one year of revision deletions on Wikipedia. It reviews the history (enabled for oversighters in January 2009 and for administrators in May 2010) and describes what could be redacted (content, username, and/or summary). Content deleted in revision deletion is publicly logged (usually with an indication of the criterion under which it was deleted) and is viewable by those who can perform it, however a stronger form (suppression or oversight) is limited to oversight users, and is not publicly logged.

Data used

  • Public deletion logs from January 2010-January 2011
  • About 50,000 redactions; these are resolved into 18,907 incidents (which may contain more than one revision)
  • Textual content of Wikipedia revisions from August 2010 (about 4 million edits)
  • Per-article view statistics for 2010

They use the final state of redactions, noting that this tends to include more information/fewer deletions than intermediate states.



Username deletion is rare. Content is deleted in 75% of incidents. The summary is deleted in 25% of incidents.

Reasons for Redaction

There are 6 reasons for redaction:

  1. Blatant copyright violations (RD1)
  2. Grossly insulting/offensive (RD2)
  3. Purely disruptive material (RD3)
  4. Revision pending suppression (RD4)
  5. Other valid deletion (RD5)
  6. Non-contentious housekeeping (RD6)

They focus on the first three of these, viewing them as "dangerous content".


Only 42% of the incidents flagged in 2010 occurred then; there is still a backlog use of this tool. However, at least .05% of revisions made in 2010 had dangerous content that was suppressed.

Response Time

The median detection period was 1.5 years, pointing to the use on a backlog. However, after May 2010, the median interval is 2 hours (but 21.6 days for copyright incidents).

Further, they define the "active duration" as the amount of time when the damage was the default (most recent) version. The median active duration is 2 minutes for all incidents but 21 days for copyright incidents.

Public Exposure to Redacted Content

The median case receives about 1.25 views before it is deleted. However, copyright incidents, which are harder to detect, have a median of 36 views and an average of 12.5 revisions.

And only .007% of Wikipedia page views (11 views/minute) are to content later deleted in this fashion.

Oversighted content

Oversighted content was not as heavily analyzed (due to lack of access to the public logs). They find content removal and usernames as the most common, while summaries were very rarely suppressed. The most common reason for suppression is the publication of individual's addresses and phone numbers, based on manual inspection of their August 2010 archives.

What is deleted

  • "content exhibiting the characteristics of libel, copyright infringement, and privacy violations"

Manual inspection from their August 2010 archives showed that RD1-offensive content was almost all directed at individual people.

Theoretical and practical relevance:

This paper provides further evidence on the effects of severe vandalism on Wikipedia. It may also be of interest to moderators of user-generated content.

Coverage and other summaries