Crowdsourcing user studies with Mechanical Turk

{{Summary
 * title=Crowdsourcing user studies with Mechanical Turk
 * authors=Aniket Kittur, Ed H Chi, Bongwon Suh
 * url=http://dx.doi.org/10.1145/1357054.1357127
 * tags=Mechanical Turk, user studies, usability testing, Wikipedia, micro-task markets, microtask markets
 * summary=This paper gives advice for using micro-task markets for user studies, to get quick (and yet reliable) feedback from users. The way a task is defined makes a significant difference in the results, and good design can reduce the number of users "gaming the system". They conclude that micro-task markets may be useful for user studies that combine objective and subjective information gathering, and provide specific advice (below).

This paper defines a "micro-task market" -- where short tasks (which take minutes or seconds) are entered into a shared system where users select them and complete them for some reward (generally money or reputation). The advantages of micro-task markets for user studies are that they are global and diverse, with very quick turnaround times (responses within 24-48 hours) at inexpensive rates (e.g. 5 cents per rating). The disadvantages are the lack of demographic inforation, lack of verifiable credentials, and limited experimenter contact.

It compares two user studies, done for the same purpose, in Amazon's Mechanical Turk.

Experiments
The purpose of both experiments was to get ratings of 14 Wikipedia articles. These were compared against expert opinions of Wikipedia administrators, which were collected in a previous experiment from He says, she says: Conflict and coordination in Wikipedia. In that experiment, admins rated articles on a 7-point Likert scale with several factors:
 * well-written
 * factually accurate
 * neutral
 * well structured
 * overall high quality

Experiment 1

 * Besides the tasks above, users were required to fill out a free-form text box describing the improvements needed.
 * Pay was 5 cents per task

Results

 * 58 users provided 210 ratings for 14 articles
 * 93 ratings were received in the first 24 hours after the task was posted; the rest came within the next 24 hours. Some tasks were completed within minutes of entry.
 * Correlation with expert Wikipedia admin ratings was marginal
 * A small number of users "gamed" the system
 * 64 completed in less than 1 minute (not even long enough to read the article)
 * 48.6% of the free-text responses were uninformative
 * 58.6% of ratings were flagged as potentially invalid, based on duration or comments -- however only 8 users gave 5 or more flagged responses, and these users accounted for 73% of the fagged responses

Experiment 2

 * Redesigned experiment to make it easier to complete the task than to "game" it
 * First, added 4 questions with verifiable, quantitative answers before users rated the quality of the article:
 * 1) how many references
 * 2) how many images
 * 3) how many sections
 * 4) 4-6 keywords to summarize the article contents
 * Second, description of "overall quality of the article" was given:
 * "By quality we mean that it is well written, factually comprehensive and accurate, fair and without bias, well-structured and organized, etc."
 * Third, in the free-text field users were asked to explain their decision.

Results

 * 124 users provided 277 ratings for 14 articles
 * Fewer ratings per user, and more distributed across users
 * Higher correlation with Wikipedia admin ratings -- Statistically significant.
 * Ony 7 responses flagged for content
 * Only 18 responses were completed in less than 1 minute
 * Median completion time was 4:06 vs 1:30

Cited by (selected)

 * "Crowdsourcing graphical perception: Using Mechanical Turk to assess visualization design", CHI 2010
 * relevance=Design tasks so that "creating believable invalid responses" takes as much work as completing the task in good faith.

Specifically:
 * Use explicitly verifiable questions
 * If you ask users to evaluate content
 * Engage users with the content (e.g. ask users to generate tags)
 * Structure tasks so that they do what a good evaluator would do.
 * have multiple ways to detect suspect responses. These might include
 * extermely short task durations
 * comments that are repeated verbatim across multiple tasks

Further, verifiable questions "[signal] to users that their answers will be scrutinized": this may help reduce invalid responses and may increase the time spent on the task. }}
 * journal=Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems
 * pub_date=2008
 * doi=10.1145/1357054.1357127
 * subject=Computer Science
 * pub_open_access=no