Applying Predictive Analytics to Flight Delays

By Ian Cassidy

Flying for business is full of uncertainty. For travelers with a tight connection window or an arrival time close to an important meeting, even a short flight delay can cause serious anxiety. Nearly a third of Upside’s business travelers encountered flight delays in the last 30 days alone. Of the delayed flights, 11% were delayed more than one hour and 4% were delayed more than two! Delays aren’t just a nuisance for the traveler either — in terms of lost productivity, companies could see losses into the hundreds of thousands, if not millions of dollars per year.

At Upside, we want to help our customers mitigate these scenarios by predicting flight delays prior to their trips and, when possible, providing alternate flight options to get them to their destinations on time.

According to the Bureau of Transportation Statistics, domestic airlines reported an on time arrival performance of about 80% for 2015–2016. Of the delayed flights, anywhere between 25–35% of those flights (depending on the month of the year) were delayed due to bad weather. Another 5–10% of delayed flights were due to late arriving aircraft. If we can model these delays accurately, we’ll account for almost half of all delayed flights per year!

This is how the Delay Predictor came to be.

In order to produce monthly and yearly airline metrics, the Bureau of Transportation Statistics also publishes the underlying data that tracks the performance of every domestic flight operated by large air carriers. We downloaded this dataset for historical flights going back to 2012 and continuously append new data as it becomes available for more recent flights. A similar dataset was published by Kaggle for all flights in 2015. As of last count, we have over 40 million rows of on-time performance data stored in a Snowflake table that is accessible to our entire data science team.

This dataset is amazingly clean in terms of having very few missing or extreme values. In addition to having expected fields such as flight number, flight duration, and scheduled departure/arrival times, it also has the delays broken out by type — like weather and late aircraft.

An obvious place to start the modeling effort was by predicting weather-induced flight delays. In order to do this, we chose to use Dark Sky’s API because it provides both historical and forecasted weather conditions using the same REST endpoint. At this point, I’d like to pause and bow down before whoever built the /forecast endpoint at Dark Sky. It’s super simple to use, relatively inexpensive, and almost always returns a valid response. Also, our team at Upside is obsessed with the Dark Sky smartphone app and I encourage everyone to download and use it!

Originally, the model was constructed as a binary classifier — either we predict that a flight would be on time or that the flight would be delayed. This performed so well both in training as well as in testing on real-time data that we decided to turn it into a multi-class model in order to predict the magnitude of the delay.

Picking the delay classes was a bit tricky, but, with the help of the histogram below, we decided to go with 0–30 minutes, 30–60 minutes, 60–120 minutes, and 120+ minutes where the 0–30 minutes class is essentially “on time.” It’s interesting that there are clear dips in the histogram at 30, 45, 60, 80, and 110 minutes, which could suggest that the airlines are doing something to avoid being late by those exact durations. One thing we do know is most airlines issue travel waivers (whereby you can change your flight without paying a change fee) if your flight is delayed more than one hour. Thus including 60 minutes as one of the class boundaries made sense. Also, it’s worth noting that there aren’t any reported arrival delays under 15 minutes, which justifies the first delay class as being considered “on time.”

Histogram of arrival delays

I’m not going to go into the details of the features or models we are using because, well, that’s proprietary (but hey, we’re hiring). However, I will point out a few non-trivial techniques that we are using to improve model performance, which I think can be generally applied to machine learning solutions:

  1. Remove the effects of seasonality. Weather-induced airline delays are heavily influenced by seasonality; e.g., hurricanes in the fall (h/t Florence) and snowstorms in the winter result in higher occurrences of delays. Machine learning models tend to perform better if you remove these effects and only train on homogeneous data. For the problem at hand, that means using a small window of flight dates to train the model, and then retraining and updating the model in production as time goes on. More about how we’re doing this in the next section.
  2. Collect multiple samples of time-dependent signals. Since the weather is dynamic, we want to include these effects as features to our model. For example, for each flight we query the Dark Sky endpoint at multiple times (say right at take-off and 3 hours before take-off) and derive features based on the temporal derivatives of the weather signals. Luckily, this was easy to do because the Dark Sky API is so awesome (did I mention that already?).
  3. Balance your classes. Machine learning models are highly susceptible to bias and one of the biggest causes of bias in a model has to do with class balance (or imbalance). How to balance the classes in your training data should be done on a case by case basis to reduce bias and overfitting. A great tool that I’ve been using lately for class balance is called imbalanced-learn.

To give some idea of how our model is performing, the figure below shows the results of the test data performance from a model that was trained using flights from late August. In the confusion matrix, the 0–3 labels correspond to our delay classes in ascending delay duration order; i.e., 0 = 0–30 minutes. A weighted f1-score of 0.62 is quite good for a 4-class problem since random guessing would result in a score of about 0.25.

Test performance of the multi-class flight delay model using late August data

The above metrics provide an idea of how the good the model is at predicting the magnitude of the delay. If we “collapse” the delay classes (1–3) into a single delay class and present the above results as if they were from a binary classifier, we can examine the ability of the model to predict any delay. A ROC-AUC score of 0.83 for a binary classification problem of this complexity is pretty good! However, looking at the confusion matrix, there are many more false negatives than false positives. This may not be ideal when it comes to predicting flight delays as we’d like to be overly aggressive in notifying a customer of a possible delay (skew towards having more false positives). In the future, it may make sense to optimize the model for a weighted recall that puts a higher penalty on getting the on time flights incorrect.

“Collapsed” test performance of the multi-class flight delay model using late August data

In testing the model on real-time data where we don’t know the exact cause of the delay, we have seen precision and recall scores around 0.4–0.5. In addition, we have been able to predict delays as far as 24 hours prior to the scheduled departure time! This is because we are relying on Dark Sky’s ability to forecast the weather, which is often very accurate.

As previously mentioned, we are handling the seasonality effects of weather-induced flight delays by only training the model on small windows of flight dates. As such, the model that is used in our production API for predicting flight delays must be retrained constantly. With the help of our amazing SRE team, we built a worker that is scheduled using a cron job to automatically retrain the model and store the best result as a pickle file in an AWS S3 bucket. The retraining workflow and model API looks roughly like the block diagram pictured below.

System architecture describing the model training worker and flight delay model API

One of the most important steps in this workflow is hyperparameter tuning of the different model architectures. We tune several different types of models using RandomizedSearchCV or hyperopt (depending on the type of model) and then pick the one that gives the best performance. It’s important to note that with any automated machine learning pipeline, logging the inputs and outputs of each step of the process is crucial to monitoring the overall health of the system. We log our results in Splunk and also post the outcome of the training process to a Slack channel.

We are actively working to improve the flight delay model to include other types of delays. Late arriving aircraft delays and National Airspace System (NAS) delays are two types of delays that we believe can have a big impact on the performance of our model. Combining the on time performance data with the FlightAware API is the approach we are pursuing to build separate models to predict these types of delays. Once built, we plan to ensemble them with the weather model with goal of not just predicting a delay magnitude, but also explaining the cause of the predicted delay.

Interested in trying out the Delay Predictor? Check it out here and let us know what you think!

If you’d like to learn more about Upside Corporate, click here! Even better, if you’d like to work with Ian & join our team, visit our team page here.