By View all posts by Jacob Falkovich

Bad news: the replication crisis replicated. Out of 21 randomly chosen psychology papers published in the prestigious Nature and Science journals in 2010-2015, only 13 survived a high-powered replication.

Good news: a prediction market where research peers could bet on which results would replicate guessed almost of them correctly. So did a simple survey of peers with no monetary incentive.

Better news: so could I.

Best news: so can you.

Rob Wiblin of 80,000 Hours put together a quiz that offers descriptions of the 21 studies and lets you guess if their main finding replicated or not. I recommend trying this out for yourself, and if you’re not confident in your ability you can review some of my previous posts on defense against the dark arts (of bullshit statistics).

The stench of bad research is difficult to hide, and a few simple rules are enough to tell the true insights into human nature from the p-hacked travesties of science. Here’s what you need to know to ace the quiz and avoid falling for the next piece of psych nonsense.

Rule 1: The Rule of Anti-Significance.

If a study has p=0.049 it is fake.

There are two studies in the quiz with p-values just below the common 0.05 threshold. I immediately (and correctly) identified both as fake without reading anything else.

If you take a Statistics 101 class at most universities, you are taught the following rule of statistical significance:

A result with a p-value above 0.05 is probably false. A p-value below 0.05 is statistically significant, meaning the result is true.

It’s never phrased like that explicitly, but that is the implied rule that people learn as they “test” hypotheses against the 0.05 threshold to get an A in the class. I got an A+ in my statistics class in business school by following that rule religiously.

But that was a long time ago. Today, allow me to present Jacob’s Rule of Anti-Significance:

A result with a p-value just above 0.05 could well be true. A result with a p-value just below 0.05 is almost certainly false.

If you understand why this is so, you know all you need to about statistics in research. If you don’t understand why this is so, allow me to explain.

Let’s start with the first part: how likely is a result with p=0.06 to replicate?

p=0.06 roughly means that the measured effect is 1.5-1.9 times the standard error, depending on the test used. A positive result is usually due to some combination of luck and true effect. If the two contributed an equal amount the true effect may only be 0.8 times the SE.

When you run a replication with 10 times the sample size (which many studies in the replication project did), the standard error will be 3.1 times lower because it should decrease as the square root of the sample size. This means that the true effect is now 0.8*3.1 = 2.5 times the standard error of the larger sample, which is more than enough to count as a successful replication. Some p=0.06 will be entirely due to luck, but a lot of them will point to a real effect.

More importantly, p=0.06 means that the researchers are honest. They could have easily p-hacked the results below 0.05 but chose not to. The opposite is true when p=0.049.

The chance that the p-value of a study will land precisely in the 0.045-0.05 range is 0.005 (1/200) if the effect doesn’t exist, or 1/60 for the most optimistic case where the effect falls right on the p=0.05 line.

But if a study was p-hacked, if the researchers kept juggling different hypotheses, including and excluding outliers, and tweaking the measurements, then it is almost guaranteed to land in the 0.045-0.05 range for p-values because that’s where the hacking will stop.

In Bayesian terms, which are the terms we should be using anyway, a p-value in the 0.045-0.05 range gives a 60-200 times higher likelihood to the hypothesis “the study was p-hacked by unscrupulous researchers” than to the hypothesis “the study got that p-value by accident”. And since unscrupulous (or just clueless) researchers in psychology are certainly more common than 1 in 60, the conclusion (i.e., posterior) is that a study with p=0.049 got that p-value by bullshit means, and its result is bullshit.

2: The Rule of Taleb’s Grandma

If the purported effect sounds implausible, it is.

You have a mind capable of simulating itself, which lets you replicate any psychological study inside your own head with N=1.

Example 1: People prefer watching TV for 12 minutes to being alone with their thoughts for 12 minutes. Right now, you’re probably procrastinating by reading this blog because you don’t want to be alone with your thoughts. This one replicated easily.

Example 2:If you imagine eating an M&M 30 times, immediately afterward you will eat fewer M&Ms from a bowl.” Do it. Imagine yourself eating an M&M: picking it up, chewing, swallowing. Now do it 29 more times. You can almost certainly feel your attitude towards M&Ms changing.

I don’t know if I would have guessed ahead of time that the effect would be to make me want fewer M&Ms, but it’s certainly plausible from my N=1 thought experiment that there would be a detectable effect one way or another. Of course, if the effect was to make people eat more M&Ms, the study would still be published! This study also replicated, with good effect size.

Example 3: Washing your hands, but not examining soap, makes you less likely to want to justify your decision of how you ranked music albums.” Imagine yourself washing your hands. Do you feel any impact on your desire to rationalize decisions? Now imagine explaining this study to Nassim Taleb’s grandma.

Psychologist: You see, Taleb’s grandma, there’s a clear link between washing your hands and justifying album-ranking choices.

Taleb’s Grandma: What the fuck are you talking about?

Psycho: Cleaning one’s hands “eliminates the postdecisional dissonance effect” by priming you to think of a “clean slate”.

Grandma: Just because we use the word “clean” in English to refer both to hands and to your conscience doesn’t mean that thinking about cleanliness in one context will change your behavior in the other context. That’s cockamamie.

Psycho: No, no, just thinking about washing your hands is not enough to prime you, even though every other priming study says it’s enough to just think of things. Thinking about soap doesn’t do anything. You need to actually wash your hands to get the effect, and not just because we tried different ways of priming and only reported the one that gave us a publishable p-value.

Grandma: Ok, so you’re saying that washing my hands makes me want to “come clean” and explain my decision on how I ranked some albums?

Psycho: It’s the opposite! Washing your hands makes you less likely to explain your decision because you already think of yourself as metaphorically clean.

Grandma: This story about washing hands and explaining decisions depends on a conjunction of multiple steps, every single one of them preposterous by themselves, and with the effect direction at each step chosen completely at random. There are more burdensome details in this hypothesis than can be lifted by 40 exhaustive studies with hundreds of participants each, let alone a single study with 40 undergrads. This is ridiculous bullshit, and I need to wash my ears with soap just to remove all trace of this nonsense from my brain.

Psycho: Well, it was good enough to get published in Science. Are you saying that peer review by experts isn’t a guarantee of true results?

Grandma: Wait till I tell my grandson about this, he’ll make an entire career out of mocking people like you. #IYI #SkinInTheGame #LindyEffect

We can summarize the takeaway in an addendum to rule 2.

Rule 2b: we should all be embarrassed that we believed anything about priming even for a second.

Rule 3: The Rule of Multiplicity

If the study looks like it tried 20 different things to get a p-value, it has. Whatever effect it claims to have found is just an artifact of multiple hypothesis testing.

I wrote a couple thousand words already about why a study that tries several hypotheses and doesn’t correct for multiplicity isn’t worth the pixels it is written on. That’s my most underrated post ever, probably because even readers who click on a self-proclaimed “math blog” called “Put a number on it” don’t want too much actual math in their blog posts.

The fun part is that you can guess which studies are multiplicitous just from their abstracts. Here’s how one of the studies was summarized on the 80,000 hours quiz:

When holding and writing on a heavier clipboard, people assessing job applicants rate them as ‘better overall’, and ‘more seriously interested in the position’.

This sentence already stank of implausible priming, but as soon as I saw the word “and” in the description, I clicked “fake” without even looking at the sample size or p-value. If you’re familiar with psychology literature, you immediately know that the study tried several clipboards, several measures of impact, and a couple dozen interaction effects just to be sure that something somewhere will hit a publishable p-value.

Here’s are some excerpts from the actual paper (courtesy of our heroes at Sci-Hub):

Physical touch experiences may create an ontological scaffold for the
development of intrapersonal and interpersonal conceptual and metaphorical knowledge.

The first sign that you’re about to be fed bullshit is an abstract full of 4-syllable words where 2-syllable words would do.

The experience of weight, exemplified by heaviness and lightness, is metaphorically associated with concepts of seriousness and importance. This is exemplified in the idioms “thinking about weighty matters” and “gravity of the situation.”

Priming is really like the Kaballah, where semi-arbitrary coincidences of language have the power to shape worlds.

In our first study, testing influences of weight on impression formation, we had 54
passersby evaluate a job candidate by reviewing resumes on either light (340.2 g) or heavy (2041.2 g) clipboards. Participants using heavy clipboards rated the candidate as better overall and specifically as displaying more serious interest in the position.
However, the candidate was not rated as more likely to “get along” with co-workers, suggesting that the weight cue affected impressions of the candidate’s performance and seriousness, consistent with a “heavy” metaphor, but not the metaphorically irrelevant trait of social likeability.

Does anyone actually believe that if the candidate was rated as easier to get along with they would admit that it contradicts their hypothesis instead of making up a just-so story about how the candidate is a “solid person” you can “lean on”?

Our second study investigated how metaphorical associations with weight affect decisionmaking […] Here, a main effect of clipboard condition, was qualified by
an interaction with participant gender.

When you’re desperate for p-values and need to come with 100 new hypotheses to test, breaking your group into arbitrary categories (by gender, age, race, astrological sign…) is the easiest way to do so. This is the “elderly Hispanic woman effect”.

Comparable to study five, participants who sat in hard chairs judged the employee to be both more stable, (p = 0.030), and less emotional, (p = 0.028), but not more positive overall . On the negotiation task, no differences in offer prices emerged
(p > 0.14).

We next calculated the change in offer prices from first to second offer, on the presumption that activating the concepts of stability and rigidity should reduce people’s decision malleability or willingness to change their offers.

Among participants who made a second offer, hard chairs indeed produced less change in offer price (M = \$896.5, SD = \$529.6) than did soft chairs (M = \$1243.6, SD = \$775.9).

This study is basically a p-hacking manual. They’re not even trying to hide it, instead describing in detail how, when a hypothesis failed to yield a p-value below 0.05, they tried more and more things until something publishable popped out by chance.

It’s OK if one study finds that clipboard weight only affects measures A and B and not C, and only does so for women and not men, if you then run another study that only looks at A, B, and women. But a study that tried 100 things and tells you about 3 of them is a like a criminal on trial who mentions that there are some banks that he didn’t rob.

4: The Rule of Silicon Boobs

If it’s sexy, it’s probably fake.

“Sexy” means “likely to get published in the New York Times and/or get the researcher on a TEDx stage”. Actual sexiness research is not “sexy” because it keeps running into inconvenient results, for example, that rich and high-status men in their forties and skinny women in their early twenties tend to find each other very sexy. The only way to make a result like that “sexy” is to blame it on the patriarchy, and the norms in psychology aren’t that far gone (yet).

So: Participants automatically project agents’ beliefs and store them in a way similar to that of their own representation about the environment (a comparison of the mean reaction time between the P-A- treatment and the P-A+ treatment)”. This is terribly unsexy. It’s also quite true, and replicated with a large effect size.

Participants in a condition that simulated the stress of being poor did worse on an attention task than those who simulated the ease of being rich.” Very sexy, as is anything that has to do with educational interventions, wealth inequality being bad, discrimination being really bad, or any other result that easily projects to a progressive policy platform. Of course, the replication found an almost-significant result in the opposite direction of the original, where people in the “poor condition” paid more attention and did better.

Anything counterintuitive is also sexy, and (according to Rule 2) less likely to be true. So is anything novel that isn’t based on solid existing research. After all, the Times is the newspaper business, not in the truthspaper one.

Finding robust results is very hard, but getting sexy results published is very easy. Thus, sexy results generally lack robustness. I personally find robustness quite sexy, but that attitude seems to have gone out of fashion since the Rennaisance.

Reasons for Optimism

Andrew Gelman wrote in 2016:

Let’s just put a bright line down right now. 2016 is year 1. Everything published before 2016 is provisional. Don’t take publication as meaning much of anything, and just cos a paper’s been cited approvingly, that’s not enough either. You have to read each paper on its own. Anything published in 2015 or earlier is part of the “too big to fail” era, it’s potentially a junk bond supported by toxic loans and you shouldn’t rely on it.

While it’s certainly true that a lot of psychology was junk science in the pre-2016 era, it wasn’t clear whether things will improve from 2016 onwards.

The replication crisis in psychology is not a new phenomenon. Statistician Jacob Cohen noted that most studies in psychology are underpowered and full of false positives back in 1962. In 1990, he noted that things have only gotten worse. Why were voices like Cohen’s ignored for more than 5 decades?

My hypothesis is that:

1. Most psychologists couldn’t understand the mathematics of what was wrong, or didn’t care to try. The standards of the field were such that they could get away with criminal methodology.
2. The psychologists who did care about mathematical rigor were at a disadvantage since they couldn’t match the publication output of their p-hacking counterparts. A lot of them probably left to do something else, like advertising in the 1960s or consumer data science in the 2010s.

But it’s harder to get away with bullshit studies if everybody knows how to spot them and everybody knows that everybody knows. If you and I can guess which studies will replicate with close to 90% accuracy, the editors of Nature and Science also can, and now they’ll have to instead of batting 62% (13/21). Researchers can’t pretend that the “replication was unlucky” if everyone could tell at a glance that the study will not replicate.

There are ways to improve the reliability of psychology research that require learning some math, although not beyond what you can learn from reading Putanumonit: estimating experimental power, calculating the likelihood of alternatives instead of null-hypothesis testing, correcting for multiplicity. But there are also fixes that don’t require knowing any math at all, like preregistering the analysis, being suspicious of interaction effects that were not in the main hypothesis, and getting a larger sample size than 20 undergrads who do it for course credit.

Hopefully, psychology researchers have started doing these things in the couple of years since it became clear that bullshit will be caught. And if they haven’t, we’ll catch them.