Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?
To take another example, suppose you are testing children for perfect pitch. You’ve tested 100 children so far and haven’t found any with perfect pitch. Do you conclude that children don’t have perfect pitch? You know that some do because you’ve heard of instances before. But your data suggest perfect pitch in children is at least rare. But how rare?
The rule of three gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. In the perfect pitch example, you could conclude that fewer than 3% of children have perfect pitch.
Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.” The situation is a little different with the perfect pitch example, however, because you may know before you start that the probability cannot be zero.
If the sight of math makes you squeamish, you might want to stop reading now. Just remember that if you haven’t seen something happen in N observations, a good estimate is that the chances of it happening are less than 3/N.
What makes the rule of three work? Suppose the probability of what you’re looking for is p. If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1-p)n = 0.05 for p. Taking logs of both sides, n log(1-p) = log(0.05) ≈ -3. Since log(1-p) is approximately –p for small values of p, we have p ≈ 3/n.
The derivation above gives the frequentist perspective. I’ll now give the Bayesian derivation of the same result. Then you can say “p is probably less than 3/N” in clear conscience since Bayesians are allowed to make such statements.
Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 – exp(-3) as N gets large, and 1 – exp(-3) ≈ 0.95.
Update: Italian translation of this post.