In 2013, social psychologist David Kidd, then a graduate student at The New School in New York City, learned that his very first paper as lead author had passed peer review and would be published in Science. Now, Kidd’s paper, which suggested that reading literary fiction improves a person’s ability to intuit the mental states of others, has come under scrutiny again—with a less gratifying outcome. It is among eight studies called into question by a painstaking effort to replicate all 21 experimental social science papers published in Science and Nature between 2010 and 2015.
Called the Social Sciences Replication Project, it is the latest bid by the nonprofit Center for Open Science (COS) in Charlottesville, Virginia, and far-flung collaborators to quality check the scientific literature. Like its predecessors, the new effort found that a large fraction of published studies don’t yield the same results when done a second time. But this time, the five independent research teams that did the replications strove to give the studies the benefit of the doubt: They increased the statistical power of the studies by enlisting, on average, five times as many participants as the originals. “This is an effort to be very generous,” says Brian Nosek, a psychologist at the University of Virginia in Charlottesville who co-founded COS and whose lab conducted five of the new replications.
That may help explain why the new project successfully replicated 62% of the experiments, compared with 39% in a much larger study of papers in three psychology journals, which COS and collaborators released in 2015. A similar project scrutinizing economics studies reported in 2016 that it had replicated 61% experiments.
The current findings, published this week in Nature Human Behavior, seem to contradict the claim that studies in high-profile journals, which put a premium on groundbreaking or surprising results, are less reproducible than those in more specialized journals. But cognitive psychologist Hal Pashler at the University of California, San Diego, cautions that differences in replication rates between various projects probably aren’t statistically significant. And the 62% figure “certainly is consistent with there being a problem” in the field, he says. “It seems funny that there’s been a drift in standards to the point where 62% seems very respectable.”
The teams aimed to test the notion that many studies are hard to reproduce because the claimed effect, though real, is inflated. If so, a replication effort would need to be more sensitive to find that smaller positive effect, Nosek says. “We didn’t want low power to be an explanation for why some effects didn’t replicate.”
The replication efforts, almost all of them designed in collaboration with original authors, were large enough to be sensitive to an effect only 75% of the size originally reported. If an initial replication attempt failed, the researchers added even more participants. The approach made a difference: Two experiments made the cut only after the sample sizes ballooned.
Kidd, now a postdoctoral researcher at Harvard University, says the extra rigor makes it easier to accept the project’s negative verdict on his study. “I can’t imagine a reason why one would privilege the original findings over this replication.” But in commentaries published alongside the study, Kidd and other authors defend the underlying hypotheses of their papers. Kidd, for example, points out that the project repeated only one experiment from each paper, and in his case, it wasn’t the strongest or the most important.
Even repeating just a fraction of the work in the papers took years and cost more than $200,000, with plenty of donated labor and lab time. But the new project also highlights a cheaper way to gauge a paper’s replicability. The authors asked roughly 200 scientists and students, most of them psychologists and economists, to guess how likely each study was to be replicated. These experts also participated in an online “prediction market,” trading shares that corresponded to studies, which paid out only if the given study was replicated.
Both approaches did well at predicting the outcome for individual studies, and they predicted an overall replication rate very close to the actual figure of 62%. The finding echoes others suggesting expert judgment is a highly accurate proxy for replication. “There is definitely some wisdom of crowds going on here,” economist Anna Dreber of the Stockholm School of Economics, a member of the replication research team, said in a press conference. Anecdotal feedback from the expert evaluators shows they had higher confidence in the replicability of studies with larger sample sizes, and were more dubious of those with surprising or counterintuitive findings.
If experts can instinctively spot an irreproducible finding, “that kind of begs the question of why that doesn’t seem to be happening in peer review,” says Fiona Fidler, a philosopher of science at The University of Melbourne in Australia. But if future studies can identify and weigh the best predictors of replicability, reviewers might be given a rubric to help them weed out problematic work before it’s published.
Another trend may also help tame the problem of irreproducible studies: the push in many fields for authors to share the design of their studies ahead of time, to keep them from changing their approach midstream in search of a flashy, statistically significant result. The studies analyzed here mostly predate that shift, Nosek says. Whether it will really boost social science’s track record, he says, is “the next big question.”