Don’t trust the Turk

Screen Shot 2013-07-10 at 6.36.10 PM

Dan Kahan gives a bunch of reasons not to trust Mechanical Turk in psychology experiments, in particular when studying “hypotheses about cognition and political conflict over societal risks and other policy-relevant facts.”

5 Responses to Don’t trust the Turk

  1. Adam Berinsky July 10, 2013 at 11:20 pm #

    It seems there are some fair points, but I’m not sure the claims made here damn MT (or any subject pool out of hand). I have said before — and I think Kahan would agree — that when running experiments, we want to think about how the composition of the sample affects our results; we don’t want to focus on the sample in isolation from our experiments. So if you want to study the behavior of conservatives and you have reason to think that conservatives in a given sample (say MT) react to your task different than conservatives in the population, that is something you should be worried about. But to go the next step and dismiss a subject recruitment process out of hand for all purposes seems to me a step too far.

    I also think that the point about repeated exposure seems fair (and I could see that it might be a bigger problem now than it was when we collected our data in 2010) but this is not just a problem for MT. Student subject pools and samples collected through labs that advertise for subjects have similar problems with professionalized subjects. Again, this is not a reason to dismiss studies that use these subject pools out of hand — researchers need to be careful about how their subjects react to tasks; prior exposure (whether through another study or reading about similar studies in a psych class) is a potential threat that always needs to be considered.

    My bottom line in that MT is a subject recruitment tool with advantages (cost, speed) and disadvantages (those raised by Kahan). But all subject recruitment tools used by social scientists have advantages and disadvantages. It might be ideal to run our experiments on random samples of the U.S. population — that’s what Time Sharing Experiments in the Social Sciences aims to do — but the high monetary cost of such a recruitment strategy might not be worth it for a particular study. They key is to carefully consider how the nature of the sample might interact with a given experiment to shape the results — this must be done on a case by case basis. So though I am sympathetic to many of the points raised by Kahan, I would push back strongly against his conclusion that we should never trust MT samples.

    • Andrew Gelman July 11, 2013 at 12:35 am #

      Adam:

      Perhaps my title was misleading. Kahan does not say “we should never trust Mechanical Turk samples.” He says, “on the basis of this pragmatic understanding of what sample validity consists in, MT samples aren’t valid for the study of culturally or ideologically grounded forms of ‘motivated reasoning’ and like dynamics that it is reasonable to believe account for polarization over climate change, gun control, nuclear power, and other facts that admit of empirical study.” That is, he’s specifically talking about the topics he studies.

      In particular, Kahan writes:

      MT grossly underrepresents individuals who identify themselves as “conservatives.” . . . In addition to how they “identify” themselves, MT worker samples don’t behave like ones that consisted of ordinary U.S. conservatives (a point that will take on more significance when I return to their falsification of their nationality). In an 2012 Election Day survey, Richey & Taylor (2012) report that “73% of these MTurk workers voted for Obama, 15% for Romney, and 12% for ‘Other’ ” (this assumes we can believe they were eligible to vote in the U.S. & did; I’ll get to this).

      But the reason to worry about the underrepresentation of conservatives in MT samples is not simply that the samples are ideologically “unrepresentative” of the general population. If that were the only issue, one could simply oversample conservatives when doing MT studies (as I’ve seen at least some authors do).

      The problem is what the underrepresentation of conservatives implies about the selection of individuals into the MT worker “sample.” . . . since we know that conservatives by and large are reticent to join the MT workforce, we also can infer there is something different about the conservatives who do sign up from the ones who don’t. . . . Whatever it is that is deterring so many conservatives from joining the MT workforce, moreover, might well be causing them to respond differently in studies from how ordinary conservatives in the U.S. population would. . . .

      • Gabriele Paolacci July 24, 2013 at 3:34 pm #

        The title and the entire post are misleading and sort of irresponsible. Dan Kahan made very specific (and insightful) points that do not speak to whether or not we should “trust MTurk for psychology experiments.” It’s evident if one reads the post, so I won’t go through it.

        Jesse Chandler (a psychologist and coauthor with Pam Mueller and I of one paper that Dan Kahan largely draws upon for his arguments) eventually addressed some of those—again very specific—points: http://www.culturalcognition.net/blog/2013/7/18/a-measured-view-of-what-can-be-validly-measured-with-m-turk.html

  2. Adam Berinsky July 11, 2013 at 10:07 am #

    Andrew: I read Kahan’s post and while he does make some specific claims about those topics, he seems to me to be making broader statement about the utility of the MT subject pool (and furthermore, it’s not clear to me what the comparison subject pool he has in mind is). And I think even the statement about his area of interest is overly broad. But like I said, he raises a number of interesting and important points that all social scientists need to consider.

  3. Erik Nisbet July 11, 2013 at 12:26 pm #

    As a political psychologist/political communication scholar I have used student samples and MTurk for many studies involving cultural or ideological motivated reasoning or related processes – and I have found both samples highly problematic and skewed compared to adult samples collected either through general population survey experiments (i.e. RDD sample, KnowledgeNetworks) or simply from opt-in online survey panels maintained by Survey Sampling International or procured via Qualtrics. Personally, I would never used student or MTurk samples for survey or experimental work unless its for a pilot/stimulus test or scale development context when dealing with ideological or cultural sensitive topics or relating to political or policy issues based on my experience to date. Also – I posed a question to Kahan on his blog earlier today about online panels in comparison to Mturk- below is his response

    @Erik:

    Great questions.

    As I’ve tried to stress, my own view is that “sample validity” has to be assessed in relation to the phenomena being investigated: given what we know about *this* sample — who it comprises, how it was assembled, etc. — can we draw valid inferences from it about how the dynamics of interest operate in populations of interest “outside the lab?”

    I think MT samples don’t support valid inferences about individual differences in cognition that (a) are measured with performance measures the validity of which will be compromised by repeat exposure; or that (b) occur in members of the general population who vary in “ideology” or political party affiliation or cultural worldviews or like measures.

    But M Turk samples might be fine for other things — say, perceptual dynamics that one believes are invariant across people & are presented in experimental stimuli the “workers” could not have seen before.

    I’d think about the validity of Qualtrics-assembled panels the same way.

    I have had discussions w/ the technical staff at Qualtrics, who I found to be super knowledgeable & very straightforward in addressing issues like this. Not surprising, since Qualtrics is a professional survey firm & recognizes that the demand for its product will depend on it being in a position to control the quality of the samples it assembles.

    Based on those discussions, my understanding is that Qualtrics is essentially a broker in the sampling recruitment industry. It will arrange to supply customers w/ samples collected from whatever independent sampling firms are in a position to supply the sort of sample that the customer needs — i.e., the kind that will enable the customer to draw the sorts inferences the customer wants to draw from collecting data.

    Qualtrics, on my understanding, doesn’t represent that they can assemble a “nationally represenative” general population sample. If you need that, then they’ll tell you should go talk to YouGov or Knowledge Networks & see if you are persuaded that the recruitment and stratification procedures they use satisfy you that the samples they are able to assemble satisfy your needs. I’m persuaded that both of those firms, which charge a a lot more than Qualtrics, will deliver a sample that is valid in that regard. Indeed, they are likely to get you something that the experience-rating that Nate Silver does suggests is better than many of the “established,” blue-chip professional survey firms (e.g., Gallup, for sure) that use “random digit dial,” which, it’s well known, is suffering a mode of sampling experiing a validity “crisis” right now due to the wasting death spiral of “land-line” phones & the degrading response rates for that mode of surveying!

    But unless you are in the business of forecasting election results or reporting “x% of American public believes y” types of findings, then it is unlikely validity for you will require a sample that is genuinely “representative” of the general public.

    Say you are studying some phenomenon in which individual differences are critical. In that case, you will be interested in assuring only (1) that the sample has enough of the kind of subjects who differ in the way that is relevant to your study hypotheses and (2) that the recruitment methods were ones that didn’t discourage typical representatives of people like that or unduly encourage atypical ones.

    For that, my sense is that Qualtrics might well be able to deliver. They certainly get that that is what they need to supply you assurance of. Again, they broker sampling services. So they will have to arrange to get sample from firms that can supply subjects of the type you want, and recruit them in ways that would satisfy you that there is no selection bias issue, and likely use some post-collection stratification procedures too that they can convince you are valid (in talking to them, I could see they understood all of this very well & they didn’t at all try to discount the complexities involved).

    BTW, they don’t use MT to construct samples! They laughed out loud when I posed this question to them. That deepened my confidence that they both know what they are doing and care about doing it right.

    I myself haven’t used Qualtrics nor have I seen any data that could help to demonstrate that in fact a sample they recruited for, say, a study of ideologically motivated reasoning on some disputed issue of risk (maybe climate change) is satisfactory. But I would be open to dealing with them myself & seeing what they came up with, and open to taking very seriously the study results of any scholar who relied on them & who could simply tell me what he or she did to attain a reasonable degree of assurance in that regard. (I do think Qualtrics would make its services more useful and attractive to scholars if it issued a “white paper” or equivalent that is super clear about the methods it uses, the standards it employs to assess sample quality, any data it has collected to validate its methods, and the identity of published studies etc that have used samples they’ve collected–if one submits a paper to a peer-reviewed journal, one would like to be able to supply such info to assure reviewers that the sample is valid!)

    But another thing about Qualtrics: if you need a specialized nonrepresentative sample, you might well conclude that they are the most likely to be able to get you what you need. Firms that speicalize in constructing large national panels suitable for assembling “nationally represenative” samples are unlikely to be able to give you an N = 1,000 sample of “doctors” or “actuaries” etc! But if you want that, Qualtrics will see what it can do; or at least that’s what they told me, and I found that very impressive, a fact I tucked away for future reference, since indeed I could see myself being interested in doing a study like that.

    So those are my thoughts. Am eager to hear more of yours & more of others who are wrestling w/ these questions.