JVM and cache warm-up strategy for high traffic services

By Baptiste haudegand

One of the main components of our platform is the SSP (short for Supply Side Platform) that automates the selling of advertising spaces.

When a user arrives on a website that includes a tag from Teads our SSP receives an Ad Call and asks several DSPs — Demand Side Platforms automating the buying of inventory — if they want to place an Ad on this opportunity through bid requests. DSPs respond to us by returning an Ad and a bid price. Then, the winner of the auction is displayed. The whole process has to be executed in a few hundred milliseconds.

To give you an idea of our scale, we are reaching 1.2 billion users per month. In one second, we are generating more than one million bid requests, in order to fill tens of thousands of slots that request for an ad.

When receiving an ad call, the SSP needs to resolve a real-time bidding auction. In order to match our advertisers’ and publishers’ requirements, we have to get data about the auction context, which includes user data, page data, and also data coming from 3rd party partners.

Without a good cache strategy, this hydration phase could take a while and lead to a timeout. Next, the auction resolution needs to be executed as fast as possible. This step mainly consists of filtering bids according to the requirements and electing the winning auction.

The schema below shows the benefits of warming-up such a process for the hydration and auction resolution logic.

SSP ad call timeline schema, showing the benefits of warming-up

Our SSP is built in Scala and based on a zero downtime architecture. Like any high-traffic application, we frequently perform scale-out operations and need to have new instances ready to receive traffic within a few minutes. This is critical as a cold instance will generate timeouts and lead to business losses.

Our use case also involves sudden increases in traffic so we need to anticipate if a newly popped instance is not able to enter the production pool.

SSP requests/min showing the traffic variation

When a JVM based app is launched, the first requests it receives are generally significantly slower than the average response time. This warm-up effect is usually due to class loading and bytecode interpretation at startup.

After 10k iterations, the main code path is compiled and “hot”. As previously mentioned, the SSP makes a lot of calls to external services. We use caches to avoid unnecessary calls requesting for the same data. But while our application starts and until these caches are all filled up, we experience high latencies.

To make the most out of our application we need to exercise latency sensitive code-paths and fill its caches before it enters the pool of production instances.

However, implementing a warm-up strategy can be dangerous for use cases that call a lot of third parties. If not performed with caution, warm-ups can generate many side-effects:

  • Sending non-production traffic to external tech platforms can trigger behavioral changes from them,
  • From an internal point of view, this can have a lot of undesired impacts on Analytics.

To avoid these potential issues during the warm-up sequence, we either sample our calls to external dependencies or don’t call them at all.

Each request we receive contains information about the route called, including:

  • Some user information,
  • An ID of the Ad placement, that triggers a process to load data linked to it,
  • Prediction information,
  • etc.

We need to consider data freshness to warm-up our application efficiently. It would be useless to warm up instances with out-of-date values.

We first built a process that generates proper warm-up logs. Thankfully, we use an ELB (Elastic Load Balancer) to dispatch requests to the registered instances so all the logs we need are stored on S3 (Simple Storage Service).

We built a service called the Cache Feeder containing a job that streams these logs every 10 minutes and:

  • Parses the data,
  • Filters unneeded requests, because we only need to warm up some specific routes and ignore requests issued to routes that are not on ad delivery critical paths,
  • Stores the results in Redis.

The Cache Feeder streams ELB logs until we have 10 000 valid requests. This dataset is renewed every 10 minutes.

We then implemented a specific start-up process for our instances. When we spawn a new instance, it first starts different services (Kafka driver, Cassandra driver, etc.). Once ready, it opens an HTTP port.

During the warm-up sequence, the instance will call itself and play the 10 000 logs over and over for a few minutes. Our ELB calls a specific route on the instance that returns a “not ready yet” message until it’s hot and ready to enter the production pool — thus receive external traffic.

Here is a step-by-step of the process:

SSP Warm-up process
  1. Requests coming from the publisher’s website go through the ELB that then dispatch them to the SSP instances,
  2. The ELB sends the logs to S3,
  3. The Cache Feeder periodically reads and parses the most recent ELB logs from S3,
  4. Then it writes the parsed ELB logs to Redis,
  5. When new instances pop, we first warm them up with the ELB logs from Redis,
  6. Once the warm-up is finished, the instances are added to the pool of SSP production instances and start to serve production requests.

We had to try different warm-up durations. We finally settled for a 2 minutes and 40 seconds warm-up and got great results. As you can see below, without any warming up we observe latency spikes and timeouts during the first minutes.

Deploying an instance with warm-up disabled

With a warmed-up instance, we no longer have timeouts and important latency spikes disappear.

Deploying an instance with warm-up enabled

The downside of a warm-up is that it adds a small delay before an instance is ready to handle production traffic (5–6 minutes). Traffic can increase a lot in 6 minutes and even outpace what newly spawned instances are able to deal with.

Solving the warm-up issue is only one part of the problem: We also need the right scaling strategy to get new instances on time.

The SSP is CPU-bound so our upscale and downscale strategies are based on a simple CPU threshold. Our initial scaling strategy spawned 3 new instances when the threshold was reached.

But we had situations where this was not enough to cope with sudden increases in traffic, not mentioning instances crashing during startup.

SSP’s initial upscale strategy

To solve this, we adopted a much more aggressive approach. We decreased the upscaling threshold and decided to spawn 10 instances at a time instead of 3.

We also adapted the downscale strategy and increased the downscaling threshold so that we can quickly withdraw unnecessary resources (and save on infrastructure costs thanks to Per-Second Billing for EC2 Instances and EBS volumes).

SSP +10 upscale strategy illustration, each upscale is usually followed by a downscale

By combining a warm-up sequence and an aggressive scaling strategy we no longer have timeouts nor difficulties to serve requests.

  • Extra attention is needed to avoid critical side effects on external dependencies.
  • Feed logs used for the warm-up must include fresh data or else it will be ineffective.
  • There is a tradeoff to be considered between having a long and effective warm-up and the associated delay involved.
  • In our case, the scaling strategy is critical and has to be adapted to the warm-up duration.
  • An aggressive scaling strategy can generate additional costs. However, by adapting our downscale strategy (CPU threshold) we are able to mitigate this. In the end, it’s a win compared to having instances failing and hence causing revenue losses.
  • It’s a never-ending topic in a growing context. We need to continuously monitor the warm-up to ensure that everything is fine and keep on improving it to reduce its duration.

If you are interested in what we do and how we are organized in the SSP team you can have a look at this previous post:

By Baptiste Haudegand and Thomas Mouron, with the help of Benjamin Davy.