The Good Judgment Project: A Large Scale Test of Different Methods of Combining Expert Predictions

From AcaWiki
Jump to: navigation, search

Citation: Lyle Ungar, Barbara Mellers, Ville Satopää, Philip Tetlock, Jon Baron (2012) The Good Judgment Project: A Large Scale Test of Different Methods of Combining Expert Predictions. 2012 AAAI Fall Symposium Series (RSS)


Tagged: prediction (RSS)


Study to answer 3 questions about how experts may best estimate event probabilities:

  • alone, prediction markets, or teams (Discussion among experts might help or harm (group think) accuracy. Prediction markets zero sum, thus discourage non-price info sharing, but facilitate consensus forming market price. Organizations form teams with belief team estimate will be more accurate.)
  • with or without training (Even people with statistics degrees shown to follow incorrect heuristics)
  • what formula to use when combining individual estimates (Many studies show uniform average of forecasts hard to beat)

2000 forecasters presented with dozens of possible events, scored on how close estimates averaged over all days questions open, match actual outcomes. Reported as Brier scores (sum of squares of differences).

Aggregation methods attempted included:

  • weighting of forcecasters based on forecaster attributes
  • weighting of forcecasts by recency
  • transformations of forecasts to push away from 0.5, toward more extreme values


  • Probability and scenario analysis training beneficial
  • Letting forcecasters see each others' forecasts and explanations beneficial
  • Pushing forecasts away from 0.5 most beneficial

Regarding last result, authors discuss irredeemable uncertainty (shared by group) and personal uncertainty (individual ignorance); aggregation of individual forecasters tend toward 0.5 due to personal uncertainty, methods for accounting for these:

  • Ask forecasters how uncertain they are, use in weighting
  • Transform all individual forecasts away from 0.5 before aggregation
  • Median of forecasts rather than mean

Brier scores (lower better, approximate based on chart):

  • 0.36 pool of less good/involved forecasters, uniformly averaged
  • 0.34 pool of better forecasters, uniformly averaged
  • 0.26 teams
  • 0.25 prediction market
  • 0.23 teams with weighting, exponential decay, and transformation away from 0.5


  • Working in groups greatly improves accuracy
  • Transformation of weighted averages away from 0.5 improves accuracy