Why You Should Never Trust a Data Scientist

by Henry Farrell on July 18, 2013 · 6 comments

in Data

Pete Warden

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than Juan[HF – John???] in Texan border towns, and so the country was on the verge of being swamped by Hispanics. …
I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. … If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way.

[via Cosma – Cross-posted at CT]

{ 6 comments }

Zach July 18, 2013 at 3:42 pm

This is a pretty sensationalist title, which, ironically, is part of the reason why this sort of thing occurs so frequently.

Juan July 18, 2013 at 6:33 pm

I think that this has little to do with the prestige of “data science” and a lot to do with the statistical, numerical, and research design illiteracy of journalists and media pundits.

Larry Bartels July 18, 2013 at 6:50 pm

Not just journalists and pundits: http://journal.sjdm.org/12/12810/jdm12810.pdf.

Patrick Ball July 18, 2013 at 8:06 pm

I’m curious: do MC readers consider “statistical, numerical, and research design illiteracy” to include a tendency to believe inferences drawn from nonrandomly selected data? I follow MC occasionally, and I’m surprised at the number of posts based on data that surely is biased by its collection process. Commentary rarely draws attention to these issues. Given the explosion in nonrandom samples, it might be nice to have a subfield developing more ways to deal with such. We’ve got raking, capture-recapture, and sensitivity analyses of various kinds. Sure we could have more and better? And insist on them when presented with analysis of, say, traffic patterns based on RFID payment schemes (e.g., EasyPass, FastTrak) which exclude drivers without bank accounts (from an article in Significance, not MC, but it’s at the top of my mind right now)?

JRLRC July 19, 2013 at 5:05 pm

If a “data scientist” just collects and presents data, is she an actual/true scientist?

Robert Young July 20, 2013 at 11:07 am

Never forget: it was the “data scientists” on Wall Street (aka, quants) who crashed the world’s economy. Still. So, the title isn’t nearly as sensational as it could be. “Data Scientists are Evil Predators”, for example.

And, do get me going on Bayesian priors.

Comments on this entry are closed.

Previous post:

Next post: