Crowdsourcing user studies with Mechanical Turk

From AcaWiki

Jump to: navigation, search


Citation: Aniket Kittur, Ed H Chi, Bongwon Suh (2008) Crowdsourcing user studies with Mechanical Turk. Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems (RSS)

doi: 10.1145/1357054.1357127

Download: http://dx.doi.org/10.1145/1357054.1357127

Tagged: Computer Science (RSS) Mechanical Turk (RSS), user studies (RSS), usability testing (RSS), Wikipedia (RSS), micro-task markets (RSS), microtask markets (RSS)


Summary:

This paper gives advice for using micro-task markets for user studies, to get quick (and yet reliable) feedback from users. The way a task is defined makes a significant difference in the results, and good design can reduce the number of users "gaming the system". They conclude that micro-task markets may be useful for user studies that combine objective and subjective information gathering, and provide specific advice (below).

This paper defines a "micro-task market" -- where short tasks (which take minutes or seconds) are entered into a shared system where users select them and complete them for some reward (generally money or reputation). The advantages of micro-task markets for user studies are that they are global and diverse, with very quick turnaround times (responses within 24-48 hours) at inexpensive rates (e.g. 5 cents per rating). The disadvantages are the lack of demographic inforation, lack of verifiable credentials, and limited experimenter contact.

It compares two user studies, done for the same purpose, in Amazon's Mechanical Turk.

Experiments

The purpose of both experiments was to get ratings of 14 Wikipedia articles. These were compared against expert opinions of Wikipedia administrators, which were collected in a previous experiment from He says, she says: Conflict and coordination in Wikipedia. In that experiment, admins rated articles on a 7-point Likert scale with several factors:

Experiment 1

Results

Experiment 2

  1. how many references
  2. how many images
  3. how many sections
  4. 4-6 keywords to summarize the article contents

Results


Cited by (selected)

Theoretical and practical relevance:

Design tasks so that "creating believable invalid responses" takes as much work as completing the task in good faith.

Specifically:

Further, verifiable questions "[signal] to users that their answers will be scrutinized": this may help reduce invalid responses and may increase the time spent on the task.



Personal tools
Namespaces
Variants
Actions
Navigation
New
Tools
Discussion
Help
Toolbox