Dropbox traffic infrastructure: Edge network

By Oleg Guba and Alexey Ivanov

In this post we will describe the Edge network part of Dropbox traffic infrastructure. This is an extended transcript of our NginxConf 2018 presentation. Around the same time last year we described low-level aspects of our infra in the Optimizing web servers for high throughput and low latency post. This time we’ll cover higher-level things like our points of presence around the world, GSLB, RUM DNS, L4 loadbalancers, nginx setup and its dynamic configuration, and a bit of gRPC proxying.

Dropbox scale

Dropbox has more than half a billion registered users who trust us with exabytes of data and petabytes of corresponding metadata. For the Traffic team this means millions of HTTP requests and terabits of traffic. To support all of that we’ve built an extensive network of points of presence (PoPs) around the world that we call Edge.

Why do we need Edge?

Many of the readers know the basic idea of CDNs: terminating TCP and TLS connections closer to the user leads to improved user experience because of greatly reduced latencies. Here is a quick reminder: a comparison of an HTTP request latency made directly to a data center and the same request made through a PoP:

Numbers are given based on:
20ms PoP↔user latency, 150ms PoP↔DC latency, 100ms server execution time

As you can see, by simply putting the PoP close to the user one can improve latency by more than factor of two.

But that is not all. Our users benefit greatly from faster file uploads and downloads. Let’s see how latency affects TCP congestion window (CWND) growth during file uploads:

Here we can see a low and high latency connection comparison that represents a connection with and without a PoP respectively:

  • The client with lower latency is progressing through slow start faster, because it is an RTT-bound process
  • This client also settles on the higher threshold for congestion avoidance, because of the lower level of packet loss. This happens because that packets need to spend less time on the public, and possibly congested, internet links—once they hit the PoP we take them into our own backbone network

Aside from all latency-related performance gains, building our own Edge network gives us (traffic engineers) a lot of freedom: for example we can easily experiment with new low- and high-level protocols, external and internal loadbalancing algorithms, and more closely integrate with the rest of the infrastructure.
We can do things like research BBR congestion control effects on file download speed, latency, and packet loss.

Picking PoP locations

As of today, Dropbox has 20 PoPs around the world:

We’ve just announced our PoP in Toronto, and will get two more in Scandinavia by the end of the year.

In 2019, we are planning to look at increasing our Edge footprint by researching the viability of PoPs in LATAM, Middle East, and APAC.

The process of PoP selection, which was easy at first, now becomes more and more complicated: we need to consider backbone capacity, peering connectivity, submarine cables, but most importantly the location with respect to all the other PoPs we have.

The current PoP selection procedure is human guided but algorithm-assisted. Even with a small number of PoPs without assistive software it may be challenging to choose between, for example, a PoP in Brazil and a PoP in Australia. The problem persists as the number of PoPs grows: e.g. what location will benefit Dropbox users better, Vienna or Warsaw?

We try to alternate new PoP placement between selecting the most advantageous PoP for the existing and potential Dropbox users.

A tiny script helps us brute-force the problem by:

  1. Splitting the Earth into 7th level s2 regions
  2. Placing all the existing PoPs
  3. Computing the distance to the nearest PoP for all the regions weighted by “population”
  4. Doing exhaustive search to find the “best” location for the new PoP
  5. Adding it to the map
  6. Looping back to step 3, etc.

By “population” one can use pretty much any metric we want to optimize, for example total number of people in the area, or number of existing/potential users. As for the loss function to determine the score of each placement one can use something standard like L1 or L2 loss. In our case we try to overcompensate for the effects of latency on the TCP throughput.

Some of you may see that the problem here that can be solved by more sophisticated methods like Gradient Decent or Bayesian Optimization. This is indeed true, but because our problem space is so small (there are less than 100K 7th level s2 cells) we can just brute-force through it and get a definitively optimal result instead of the one that can get stuck on a local optimum.

GSLB

Let’s start with the most important part of the Edge—GSLB. GSLB is responsible for loadbalancing users across PoPs. That usually means sending each user to the closest PoP, unless it is over capacity or under maintenance.

GSLB is called the “most important part” here because if it misroutes users to the suboptimal PoPs frequently, then it makes the Edge network useless, and potentially even harms performance.

The following is a discussion of commonly used GSLB techniques, their pros and cons, and how we use them at Dropbox.

BGP anycast

Anycast is the easiest loadbalancing method that relies on the core internet routing protocol, BGP. To start using anycast it is sufficient to just start advertising the same subnet from all the PoPs and internet will deliver packet to the “optimal” one automagically.

Even though we get automatic failover and simplicity of the setup, anycast has many drawbacks, so let’s go over them one by one.

Anycast performance

Above, we mentioned that BGP selects the “optimal” route and for the most part that is true. The problem is that BGP does not know anything about link latency, throughput, packet loss, and so on. Generally in the presence of multiple routes to the destination, it just selects one with the least number of hops.

Anycast-based loadbalancing is mostly optimal but it behaves poorly on high percentiles.

This is true for a small and medium number of PoPs. But there is a conjecture that “critical” misrouting probability (e.g. probability of routing user to a different continent) in an anycasted network drops sharply with number of PoPs. Therefore it is possible that with increasing number of PoPs, anycast may eventually start outperforming GeoDNS. We’ll continue looking at how our anycast performance scales with the number of PoPs.

Traffic steering
With anycast, we have very limited control over traffic. It is hard to explicitly move traffic from one PoP to another. We can do some traffic steering using MED attributes, prepending AS_PATHs to our announces, and by explicitly communicating with traffic providers, but this is not scalable.

Also note that in the N WLLA OMNI mnemonic AS_PATH is somewhere in the middle. This effectively means that it can be easily overridden by an administrator and in practice this makes BGP anycast pick the “cheapest” route, not the “nearest” or the “fastest.”

Another property of anycast is that graceful drain of the PoP is impossible—since BGP balances packets and not connections. After the routing table change, all inflight TCP sessions will immediately be routed to the next best PoP and users will get an RST from there.

Troubleshooting
Generally reasoning about traffic routing with anycast becomes very non-trivial, since it involves the state of internet routing at a given time. Troubleshooting performance issues with anycast is hard and usually involves a lot of traceroutes, looking glasses, and back and forth communication with providers along the way.

Note that, as in the case of a PoP drain, any connectivity change in the internet has a possibility of breaking users’ connections to anycasted IP addresses. Troubleshooting intermittent connection issues due to internet routing changes or faulty/misconfigured hardware can be challenging.

Here is an example of an epic anycast troubleshooting by Fastly NOC from NANOG mailing list: Service provider story about tracking down TCP RSTs.
TL;DR SYNs passing through the router had different TTL and at the same time this IP field was used for the ECMP flow hashing.

Tools
Here are couple of tricks you can use to make troubleshooting a bit easier (especially in case of anycast).

Of course having a random request ID associated with every request that goes through the system and can be traced in the logs is a must. In case of the Edge, it is also helpful to echo back a header with the name of the PoP you’re connected to (or embed this into the unique request ID).

Another useful thing that is commonly used is to create “debug” sites that can pre-collect all the troubleshooting data for the user so that they can attach it to the support ticket e.g.: github-debug.com, fastly-debug.com, and of course dropbox-debug.com, which was heavily inspired by them.

Traffic team projects like dropbox-debug, Brotli static precompression, BBR evaluation and rollout, RUM DNS, and many others came out of Hack Week: a company-wide event that inspires us to try something new and exciting!

Anycast at Dropbox
With all that said, we still use anycast for our APEX domains like dropbox.com (without www) and as a fallback in case of major DDoS attacks.

GeoDNS

Let’s talk about another common solution for implementing GSLB: GeoDNS. In this approach each PoP has its own unique unicast IP address space and DNS is responsible for handing off different IP addresses to different users based on their geographical location.

This gives us control over traffic steering and allows graceful drain. It is worth mentioning that any kind of reasoning about unicast-based setup is much easier, therefore troubleshooting becomes simpler.

As you can see, there are a lot of variables involved: we rely on a DNS provider guessing user IP by their DNS resolver (or trust EDNS CS data), then guessing user location by their IP address, then approximate physical proximity to latency.

Note that different DNS providers will likely end up with different decisions, based on their algorithms and quality of their GeoIP database, therefore monitoring performance of multi-provider DNS setup is much harder.

Aside from that, DNS also has a major problem with stale data. Long story short: DNS TTL is a lie. Even though we have TTL of one minute for www.dropbox.com, it still takes 15 minutes to drain 90% of traffic, and it may take a full hour to drain 95% of traffic:

Here we also need to mention the myriad embedded devices using Dropbox API that range from video cameras to smart fridges which have a tendency of resolving DNS addresses only during power-on.

GeoDNS at Dropbox

Our DNS setup evolved quite a bit over last few years: we started with a simple continent→PoP mappings, then switched to country→PoP with a per-state mapping data for serving network traffic to large countries like the US, Canada, etc. At the moment, we are juggling relatively complex LatLong-based routing with AS-based overrides to work around quirks in internet connectivity and peering.

Hybrid unicast/anycast GSLB

Let’s very briefly cover one of the composite approaches to GSLB: hybrid unicast/anycast setup. By combining unicast and anycast announces along with GeoDNS mapping, one can get all the benefits of unicast along with an ability to quickly drain PoPs in case of an outage.

One can enable this hybrid GSLB by announcing both PoP’s unicast subnet (e.g. /24) and one of its supernets (e.g. /19) from all of the PoPs (including itself).

This implies that every PoP should be set up to handle traffic destined to any PoP: i.e. have all the VIPs from all the PoPs in the BGP daemons/L4 balancers/L7 proxies configs.

Such an approach gives us the ability to quickly switch between unicast and anycast addresses and therefore immediate fallback without waiting for DNS TTL to expire. This also allows graceful PoP draining and all the other benefits of DNS traffic steering. All of that comes at a relatively small operational cost of a more complicated setup and may cause scalability problems once you reach the high thousands of VIPs. On the bright side, all PoP configs now become more uniform.

Real User Metrics

All the GSLB methods discussed up until now have one critical problem: none of them uses actual user-perceived performance as a signal, but instead rely on some approximations: BGP uses number of hops as a signal, while GeoIP uses physical proximity. We want to fix that by using Real User Metrics (RUM) collection pipeline based on performance data from our desktop clients.

Companies that do not have an app usually do latency measurements with the JS-based prober on their website.

Years ago we invested in an availability measurement framework in our Desktop Clients to help us estimate the user-perceived reliability of our Edge network. The system is pretty simple: once in a while a sample of clients run availability measurements against all of our PoPs and report back the results. We extended this system to also log latency information, which gave us sufficient data to start building our own map of the internet.

We also built a separate resolver_ip→client_ip submap by joining DNS and HTTP server logs for http requests to random subdomain of a wildcard DNS record. On top of which we apply a tiny bit of post-processing for EDNS ClientSubnet-capable resolvers.

We combine the aggregated latencies, resolver_ip→client_ip map, BGP fullview, peering information, and capacity data from our monitoring system to produce the final map of client_subnet→PoP.

We are also considering adding a signal from the web server logs, since we already have TCP_INFO data, including number of retransmits, cwnd/rwnd, and rtt.

After which we pack this map into a radix tree and upload it to a DNS server, after which it is compared to both anycast and GeoIP solutions.

Specifics of map generation are up in the air right now: we’ve tried (and continue trying out) different approaches: from simple HiveQL query that does per-/24 aggregation to ML-based solutions like Random Forests, stacks of XGBoosts, and DNNs. Sophisticated solutions are giving slightly better, but ultimately comparable results, at the cost of way longer training and reverse engineering complexity.
At least for now, we are sticking with the solution that is easier to reason about and easier to troubleshoot.

Data anonymization and aggregation
We anonymize and aggregate all latency and availability data by /24 subnet in case of IPv4 and /56 in case of IPv6. We don’t operate directly on real user IPs and enforce strict ACL and retention policies for all RUM data.

Data cleanup
Data cleanup is a very important step in the map data pipeline. Here are couple of common patterns that we’ve found during our map construction:

  • Standard GetTickCount64 timer on Windows is quantized by around 16ms. In our Python client we’ve switched to time.perf_counter().
  • TCP and HTTP probes are way less reliable than HTTPS. This is mostly due to IP and DNS hijacking in the wild. Good examples of this are Wi-Fi captive portals.
  • Even unique DNS requests can be received multiple times. Both due to lost responses and proactive cache refreshes by some DNS servers even unique queries like UUID4.perf.dropbox.com can be duplicated. We take that into account when joining the HTTP and DNS logs.
  • And of course there are all kinds of weird timing results from negative and submicrosecond results to ones that are older than our universe (they probably came from another one).

Data extrapolation
Currently we use the following techniques for speculatively expanding the resulting map:

  • If all the samples for an AS end up in the same “best” PoP we consider that all IP ranges announced by that AS should go to that PoP.
  • If AS has multiple “best” PoPs then we break it down into announced IP ranges. For each one we assume that if all measurements in a range end up at the same PoP we can extrapolate that choice to the whole range.

This technique allows us to double our map coverage, make it more robust to changes, and generate a map using a smaller dataset.

Troubleshooting DNS map
Once a RUM-based map is constructed, it is crucial to be able to estimate how good it is by using a single value, something like an F1 score used for binary classification or BLEU score used for evaluating machine translation. That way, one can not only automatically prevent bad maps from going live, but also numerically compare the quality of different map iterations and construction algorithms.

Another common approach for the map evaluation is to test it against the subset of data that training process did not see.

For interactive slicing and dicing of data and ad-hoc troubleshooting, we map subnets back into the lat/long coordinates, aggregate their stats by h3 regions and then draw them with kepler.gl. This is very helpful to quickly eyeball maps that have low scores.

We went with h3 here instead of s2 because Kepler has built-in support for it, and generally h3 has simpler Python interface, therefore making it easier for us to experiment with visualizations.
Whisper: also hexagons look cooler =)

The same approach can be used for visualizing current performance, week-over-week difference, difference between GeoIP database versions, and much more.

You can clearly see here where our PoPs are located. All of the big patches of blue and violet colors are getting their PoPs later this year or next year.

Another way of visualizing IP maps is to skip the whole GeoDNS mapping step and plot IP addresses on the 2D plane by mapping them on a space filling curve, e.g. Hilbert curve. One can also place additional data in the height and color dimensions. This approach will require some heavy regularization for it to be consumable by humans and even more ColorBrewer2 magic to be aesthetically pleasing.

RUM DNS at Dropbox
RUM-based DNS is an actively evolving project, and we have not shipped it to our main VIPs yet, but the data we’ve collected from our GSLB experiments shows that it is the only way we can properly utilize more than 25-30 PoPs. This project will be one of our top priorities in 2019, because even metrics collected from an early map prototypes show that it can improve effectiveness of our Edge network by up to 30% using RUM DNS.

It will also provide all the byproducts needed for the Explicit Loadbalancer… Speaking of which…

Explicit loadbalancing

A quick note about another more explicit way of routing users to PoPs. All these dances with guessing users’ IP address based on their resolver, GeoIP effectiveness, optimality of decisions made by BGP, etc. are all no longer necessary after a request arrives at the PoP. Because at that point in time, we know the users’ IP and even have an RTT measurement to them. At that point, we can route users on a higher level, like for example embedding a link to a specific PoP in the html, or handing off a different domain to a desktop client trying to download files.

The same IP→PoP map that was constructed for RUM DNS can be reused here, now exposed as an RPC service.

This loadbalancing method allows for a very granular traffic steering, even based on per-resource information, like user ID, file size, and physical location in our distributed storage. Another benefit is almost immediate draining of new connections, though references to resources that were once given out may live for extended periods of time.

Very complex schemes can be invented here, for example we can hand off whole URLs that in the domain name embed information for external DNS-based routing and at the same time embed information for internal routing inside path/queryargs, that will allow PoP to make more optimal routing decision. Another approach is to put that additional data as an opaque blob into the encrypted/signed cookie.

All of these possibilities sound exciting, so care must be taken to not overcomplicate the system.