Two years ago, my colleague and I came across TensorFlow, and despite having no idea what backpropagation or hidden layers were, we decided it would be cool to learn how to build machine learning web services. We found an abundance of resources for learning the basics of training machine learning models, but less information about deploying models as scalable web services. There wasn’t a clear path from Jupyter notebook to production.
In retrospect, this isn’t surprising because working with TensorFlow, PyTorch, or other machine learning frameworks requires a very different skillset than dealing with Docker, Kubernetes, NVIDIA drivers, and various AWS services.
Without the right infrastructure, models can take weeks instead of minutes to go from laptop to cloud, request latencies can be too high to provide an acceptable user experience, and production workloads can incur massive compute costs.
We realized that big tech companies like Uber, Netflix, and Spotify have in-house machine learning infrastructure teams to empower their data scientists and engineers to deploy machine learning models in production. We’re building Cortex, an open-source, cloud-native platform for scaling real-time inference APIs, so that any developer at any company can deploy machine learning models in production.
Scaling machine learning inference is hard
Our goal is not to reinvent the wheel, and instead to use as much existing technology as possible to solve the problem. There are, however, some unique challenges in scaling machine learning inference that require Cortex to work differently than other deployment platforms.
Let’s assume you want to deploy OpenAI’s 1.5B parameter GPT-2 as a web service to add text generation functionality to your app:
GPT-2 is compute hungry: It may utilize a CPU at nearly 100% for several minutes to return one paragraph of text. That kind of latency may be annoying while testing, but in production users frequently abandon websites with seconds of latency, let alone minutes.
GPT-2 is memory hungry: Besides CPU, GPT-2 needs a lot of RAM for a single inference. If your underlying web server can’t provide a huge amount of RAM your latency gets even worse or the API may crash.
GPT-2 is >5GB: Just loading it into memory takes a while, so naive approaches to updating a live web service could result in minutes of downtime on every update.
GPT-2 in production is expensive: You may need to deploy more servers than you have concurrent users if each user is making several requests per minute.
Cortex makes scaling machine learning inference easy
Cortex is a platform for deploying machine learning models as production web services. It is designed specifically for running real-time inference at scale. Autoscaling, CPU and GPU support, and spot instance support allow you to run large inference workloads without racking up huge AWS bills. Rolling updates, log streaming, and prediction monitoring enable rapid iteration while minimizing downtime. Supporting multiple frameworks while requiring minimal configuration make Cortex clusters and deployments easy to launch and maintain.
Production workloads aren’t always predictable, so Cortex automatically scales your prediction APIs to meet maximum traffic workloads to avoid high latency, and scales down automatically when traffic is lower to reduce your AWS bill.
CPU and GPU support
Cortex web services can seamlessly run on CPUs, GPUs, or both. While CPUs get the job done for simple models, GPUs are necessary to run large deep learning models fast enough to provide API responses in real-time without compromising end user experience.
Machine learning inference can get expensive fast because it can be so compute intensive. That being said, spot instances can unlock significant discounts with the caveat that AWS can reclaim the instance at any time. Cortex has built-in fault tolerance so you don’t have to worry.
Suppose you have 100s of GPU instances serving requests to your users, and now you’ve figured out a way to train a more accurate model. Cortex makes it easy to transition your web service to the new model without affecting its availability or latency.
Debugging machine learning models is hard, but seeing the logs in real-time can help streamline the process. For example, real-time logs can be monitored to ensure that request payloads are transformed correctly to match the model’s input schema.
Production web services need to be monitored. For machine learning APIs, it’s especially important to track predictions to ensure that models are performing as expected.
Cortex supports all the Python machine learning frameworks: TensorFlow, PyTorch, scikit-learn, XGBoost, etc. Data scientists and machine learning engineers have different preferences when it comes to the tools they use to build models, and their deployment infrastructure should accommodate all frameworks with a deployment API that’s as close to uniform as possible.
Configuration should be simple, flexible, and reproducible.
cluster.yaml files create predictable clusters and
cortex.yaml files create predictable model deployments with minimal verbosity.
Our high-level philosophy is that shipping production machine learning services requires both machine learning and distributed systems expertise. It’s rare to find people who have experience with both. We decided to be opinionated when it comes to infrastructure decisions, and leave all the data science decisions to our users.
Cortex is for production use cases. Some of our users run clusters with 100s of GPUs to handle their production traffic. That’s hard to do on a laptop.
Cortex can be deployed in your AWS account. That means that you’ll have access to all the instances, autoscaling groups, security groups, and other resources that get provisioned for you when you launch a Cortex cluster. It also means that your machine learning infrastructure spending is fully visible in your AWS billing dashboard.
We get a lot of questions about GCP, Azure, and private cloud support. Right now, we’re focused on AWS because we’re a small team and we really want to get the experience right without spreading ourselves too thin.
Choosing the right AWS compute service wasn’t obvious. We wanted a managed service that didn’t charge a premium as a function of EC2 costs which helped us eliminate Fargate and SageMaker.
We also wanted to be able to run arbitrary containers with potentially large compute and memory needs which eliminated Lambda. We were left with ECS and EKS and ultimately chose EKS because building on the Kubernetes APIs opens the door to supporting other cloud providers more easily.