An obscure controversy has reared its ugly head again this past month. Two icons of the quantitative analysis community have locked horns on the greatest of public stages, Twitter. You may be forgiven for not following the controversy: I’ll do a quick review for the uninitiated. All code and data used to create this article can be forklifted from this MatrixDS project.
Nate Silver is the co-founder of FiveThirtyEight. A massively popular data focused blog that gained fame for its accuracy predicting the outcomes for the U.S. elections in 2008. Silver generates predictions using a clever poll aggregating technique which accounts for biases, such as pollsters who only call people with landlines.
A trained statistician, via economics, he directed his passion for baseball (sabermetrics) and poker analytics into the arena of politics. In fact, the name FiveThirtyEight is a nod to the number of U.S. electoral votes (538 of them). However, the blog also covers other interest areas like sports. Nate sold his blog to ESPN and took the job of Editor in Chief. They (ESPN) used it as a platform to feed their audience with forecasts of sporting events, FiveThirtyEight has since moved to ABC. A routine visit to their website is greeted with a mix of political and sports articles with detailed predictions and data visualizations.
Nate’s forecasting prowess has become accepted as canon in the popular media. He is a routine guest on many nationally televised shows to discuss his predictions during every national election cycle. So it came as quite a shock when Nassim Taleb, a best-selling author, and quantitative risk expert, publicly announced that FiveThirtyEight does not know how to forecast elections properly!
For his part, Taleb has become extremely successful due to his shrewd understanding of probability in the real world. His books are both philosophical and technical, with a focus on uncertainty and risk. Specifically, he believes that the vast majority of quantitative models used in practice do not sufficiently account for real-world risk. Instead, they give the illusion of short-term value (like being accurate in some well-understood situations) but expose the unknowing users to enormous systemic risk when they experience situations the models are not designed to understand.
Taleb gained fame, in part, because he puts his philosophy into action by exposing his wealth. Malcolm Gladwell wrote an article in the New Yorker on how Taleb turned his philosophy on risk into an incredibly successful investment strategy. He has since gained significant wealth during unforeseen market events such as the Russian debt default, 9/11, and the financial crisis of 2008. Taleb now spends much of his time writing and deadlifting (I’m jealous of this bit). He is not shy about telling someone publicly that he disagrees with them: One of those people is Nate Silver.
However, Silver isn’t taking the insults lying down!
Silver and Taleb, with three million and 300k followers respectively, create an enormous buzz with these exchanges (starting back in 2016). However, a quick read through the comment threads and you will realize that few people understand the arguments. Even Silver himself seems taken off guard by Taleb’s attack.
I think, however, this is a great opportunity for a data science professional (or aspiring professional) to dig deeper into what is being said. There are implications on how we choose to model and present our work in a reliable and verifiable way. You must decide for yourself if Taleb has a point or is a just another crazy rich person with too much time on his hands.
The primary source of controversy and confusion surrounding FiveThirtyEight’s predictions is that they are ‘probabilistic.’ Practically what this means is that they do not predict a winner or looser but instead report a likelihood. Further complicating the issue, these predictions are reported as point estimates (sometimes with model implied error), well in advance of the event. For example, six months before polls open, this was their forecast of the 2016 presidential election.
Their forecast process is to build a quantitative replica of a system with expert knowledge (elections, sporting events, etc.) then run a Monte Carlo simulation. If the model closely represents the real-world, the simulation averages can be reliably used for probabilistic statements. So what FiveThirtyEight is actually saying is:
x% of the time our Monte Carlo simulation resulted in this particular outcome
The problem is that models are not perfect replicas of the real world and are, as a matter of fact, always wrong in some way. This type of model building allows for some amount of subjectivity in construction. For example, Silver has said on numerous occasions that other competitive models do not correctly incorporate correlation. When describing modeling approaches, he also makes clear that they tune outcomes (like artificially increasing variance based on the time until an event or similar adjustments). This creates an infinitely recursive debate as to whose model is the ‘best’ or most like the real world. Of course, to judge this, you could look at who performed better in the long run. This is where things go off the rails a bit.
Because FiveThirtyEight only predicts probabilities, they do not ever take an absolute stand on an outcome: No ‘skin in the game’ as Taleb would say. This is not, however, something their readers follow suit on. In the public eye, they (FiveThirtyEight) are judged on how many events with forecasted probabilities above and below 50% happened or didn’t respectively (in a binary setting). Or, they (the readers) just pick the highest reported probability as the intended forecast. For example, they were showered with accolades when after, ‘calling 49 of 50 states in the 2008 presidential race correctly’ Nate Silver was placed on Times 100 most influential people list. He should not have accepted the honor if he didn’t call a winner in any of the states!
The public can be excused for using the 50% rule without asking. For example, in supervised machine learning, a classification model must have a characteristic called a ‘decision boundary.’ This is often decided a priori and is a fundamental part of understanding the quality of the model after it is trained. Above this boundary, the machine believes one thing and below it the opposite (in the binary case).
For standard models, like logistic regression, the default decision boundary is assumed to be 50% (or 0.5 on a 0 to 1 scale) or the alternative with the highest value. Classical neural networks designed for classification often use softmax functions which are interpreted in just this way. Here is an example Convolutional Neural Network performing an image classification using computer vision. Even this basic Artifical Intelligence model manages to make a decision.
If FiveThirtyEight has no stated decision boundary, it can be difficult to know how good their model actually is. The confusion is compounded when they are crowned, and gladly accept it, with platitudes of crystal ball-like precision in 2008 and 2012, due to the implied decision boundary. However, when they are accused of being wrong they fall back to a simple quip:
Often this is followed up with an exposé about how they only reported x%, so that means that (1-x)% can also happen. It’s a perfect scenario; they can never be wrong! We should all be so lucky. Of course, this probabilistic argument may be valid, but it can cause some angst if it seems disingenuous. Even the Washington Post had an opinion piece which opined as much during the 2016 election.
What is not clear is that there is a factor hidden from the FiveThirtyEight reader. Predictions have two types of uncertainty; aleatory and epistemic. Aleatory uncertainty is concerned with the fundamental system (probability of rolling a six on a standard die). Epistemic uncertainty is concerned with the uncertainty of the system (how many sides does a die have? And what is the probability of rolling a six?). With the later, you have to guess the game and the outcome; like an election!
Bespoke models, like FiveThirtyEight’s, only report to the public aleatory uncertainty as it concerns their statistical outputs (inference by Monte Carlo in this case). The trouble is that epistemic uncertainty is very difficult (sometimes impossible) to estimate. For example, why didn’t FiveThirtyEight’s model incorporate, before it happened, a chance that Comey would re-open his investigation into Clintons emails? Instead, this seems to have caused a massive spike in the variation of the prediction. Likely because this event was impossible to forecast.
Instead, epistemically uncertain events are ignored a priori and then FiveThirtyEight assumes wild fluctuations in a prediction from unforeseen events are a normal part of forecasting. Which should lead us to ask ‘If the model is ignoring some of the most consequential uncertainties, are we really getting a reliable probability?’
To expand on this further, I have consolidated some of FiveThirtyEight’s predictions, using their open source data, for two very different types of events; U.S. Senate elections and National Football Leauge (NFL) Games. Here is a comparison to the final forecast probability and the actual proportion of outcomes.
The sports data (NFL games) has an excellent linear relationship. These proportions are built using 30K data points, so, if we assume the system is stable, we have averaged out any sampling error. However, as you can see, there is still a noticable variation of 2–5% of actual proportion to predictions. This is a signal of un-addressed epistemic uncertainty. It also means you cannot take one of these forecast probabilities at face value.
Sports, like other games of chance, have very well defined mechanisms which lend themselves to statistical analysis. On the other hand, highly non-linear events, like contested elections, may not. With much fewer data points you can see the variation of the Senate predictions is enormous. Gauging the performance of models on these types of events becomes doubly difficult. It isn’t clear if a prediction is wrong owing to the quality of the model (epistemic) or just luck (aleatory).
One of the most troubling things about this approach to forecasting is that it opens pandora’s box for narrative fallacies. Why did Clinton lose? Comey? Email servers? People can then justify possibly spurious inferences by eyeballing events which occur around the forecast variation. ‘Just look at how the forecast is changing will all this news!’
I think this is what has Taleb up in arms. The blog feels more like a slick sales pitch, complete with quantitative buzzwords, than unbiased analysis (though it may very well be). If a prediction does not obey some fundamental characteristics, it should not be marketed as a probability. More importantly, a prediction should be judged from the time it is given to the public and not just the moment before the event. A forecaster should be held responsible for both aleatory and epistemic uncertainty.
When viewed this way, it is clear that FiveThirtyEight reports too much noise leading up to an event and not enough signal. This is great for driving users to read long series of related articles on the same topic but not so rigorous to bet your fortune on. Taleb and Silvers take on how FiveThirtyEight should be judged can be visualized like this.
Because there is so much uncertainty around non-linear events, like an election, it could reasonably be considered frivolous to report early stage forecasts. The only conceivable reason to do so is to capture (and monetize?) the interest of a public which is hungry to know the future. I will not go into the technical arguments, Taleb has written and published a paper on the key issues with a solution.
Here we can say, with some confidence that FiveThirtyEight predictions are not reliable probabilities. However, they masquerade as one, being between 0 and 1 and all. This is Taleb’s primary argument; FiveThirtyEight’s predictions do not behave like probabilities that incorporate all uncertainty and should not be passed off as them.
I do not want to suggest that FiveThirtyEight is bad at their craft. They are, likely, the best poll aggregator in the business. If we only look at the last reported probabilistic forecast and use the public’s decision boundary, they are more successful than any other source attempting the same task. However, positioning yourself to appear correct regardless of the outcome, making users infer their own decision boundaries, over-reporting of predictions, and ignoring epistemic uncertainty should not be overlooked. How goes FiveThirtyEight’s reputation, so goes much of the data community’s reputation.
Be clear on your suggested decision boundary, probabilistic statements, assumptions about uncertainty and you’ll be less likely to misguide stakeholders.
Follow me on LinkedIn: https://www.linkedin.com/in/isaacfaber/
Follow me on MatrixDS: https://community.platform.matrixds.com/community/isaacfab/overview