Back in 2015 we deployed ECMP routing - Equal Cost Multi Path - within our datacenters. This technology allowed us to spread traffic heading to a single IP address across multiple physical servers.
You can think about it as a third layer of load balancing.
- First we split the traffic across multiple IP addresses with DNS.
- Then we split the traffic across multiple datacenters with Anycast.
- Finally, we split the traffic across multiple servers with ECMP.
When deploying ECMP we hit a problem with Path MTU discovery. The ICMP packets destined to our Anycast IP's were being dropped. You can read more about that (and the solution) in the 2015 blog post Path MTU Discovery in practice.
To solve the problem we created a small piece of software, called
pmtud (https://github.com/cloudflare/pmtud). Since deploying
pmtud, our ECMP setup has been working smoothly.
Hardcoding IPv6 MTU
During that initial ECMP rollout things were broken. To keep services running until
pmtud was done, we deployed a quick hack. We reduced the MTU of IPv6 traffic to the minimal possible value: 1280 bytes.
This was done as a tag on a default route. This is how our routing table used to look:
$ ip -6 route show ... default via 2400:xxxx::1 dev eth0 src 2400:xxxx:2 metric 1024 mtu 1280
mtu 1280 in the default route.
With this setting our servers never transmitted IPv6 packets larger than 1280 bytes, therefore "fixing" the issue. Since all IPv6 routers must have an MTU of at least 1280, we could expect that no ICMP Packet-Too-Big message would ever be sent to us.
Remember - the original problem introduced by ECMP was that ICMP routed back to our Anycast addresses could go to a wrong machine within the ECMP group. Therefore we became ICMP black holes. Cloudflare would send large packets, they would be dropped with ICMP PTB packet flying back to us. Which, in turn would fail to be delivered to the right machine due to ECMP.
But why did this problem not appear for IPv4 traffic? We believe the same issue exists on IPv4, but it's less damaging due to the different nature of the network. IPv4 is more mature and the great majority of end-hosts support either MTU 1500 or have their MSS option well configured - or clamped by some middle box. This is different in IPv6 where a large proportion of users use tunnels, have Path MTU strictly smaller than 1500 and use incorrect MSS settings in the TCP header. Finally, Linux implements RFC4821 for IPv4 but not IPv6. RFC4821 (PLPMTUD) has its disadvantages, but does slightly help to alleviate the ICMP blackhole issue.
Our "fix" - reducing the MTU to 1280 - was serving us well and we had no pressing reason to revert it.
Researchers did notice though. We were caught red-handed twice:
When small MTU is too small
This changed recently, when we started working on Cloudflare Spectrum support for UDP. Spectrum is a terminating proxy, able to handle protocols other than HTTP. Getting Spectrum to forward TCP was relatively straightforward (barring couple of awesome hacks). UDP is different.
One of the major issues we hit was related to the MTU on our servers.
During tests we wanted to forward UDP VPN packets through Spectrum. As you can imagine, any VPN would encapsulate a packet in another packet. Spectrum received packets like this:
+---------------------+------------------------------------------------+ + IPv6 + UDP header | UDP payload encapsulating a 1280 byte packet | +---------------------+------------------------------------------------+
It's pretty obvious, that our edge servers supporting IPv6 packets of max 1280 bytes won't be able to handle this type of traffic. We are going to need at least 1280+40+8 bytes MTU! Hardcoding MTU=1280 in IPv6 may be acceptable solution if you are an end-node on the internet, but is definitely too small when forwarding tunneled traffic.
Picking a new MTU
But what MTU value should we use instead? Let's see what other major internet companies do. Here is a couple of examples of advertised MSS values in TCP SYN+ACK packets over IPv6:
+---- site -----+- MSS --+- estimated MTU -+ | google.com | 1360 | 1420 | +---------------+--------+-----------------+ | facebook.com | 1410 | 1470 | +---------------+--------+-----------------+ | wikipedia.org | 1440 | 1500 | +---------------+--------+-----------------+
I believe Google and Facebook adjust their MTU due to their use of L4 load balancers. Their implementations do IP-in-IP encapsulation so need a bit of space for the header. Read more:
There may be other reasons for having a smaller MTU. A reduced value may decrease the probability of the Path MTU detection algorithm kicking in (ie: relying on ICMP PTB). We can theorize that for the misconfigured eyeballs:
- MTU=1280 will never run Path MTU detection
- MTU=1500 will always run it.
- In-between values would have increasing different chances of hitting the problem.
But just what is the chance of that?
A quick unscientific study of the MSS values we encountered from eyeballs shows the following distributions. For connections going over IPv4:
IPv4 eyeball advertised MSS in SYN: value |-------------------------------------------------- count cummulative 1300 | * 1.28% 98.53% 1360 | **** 4.40% 95.68% 1370 | * 1.15% 91.05% 1380 | *** 3.35% 89.81% 1400 | ******** 7.95% 84.79% 1410 | * 1.17% 76.66% 1412 | **** 4.58% 75.49% 1440 | ****** 6.14% 65.71% 1452 | ************ 11.50% 58.94% 1460 |************************************************** 47.09% 47.34%
Assuming the majority of clients have MSS configured right, we can say that 89.8% of connections advertised MTU=1380+40=1420 or higher. 75% had MTU >= 1452.
For IPv6 connections we saw:
IPv6 eyeball advertised MSS in SYN: value |-------------------------------------------------- count cummulative 1220 | *** 4.21% 99.96% 1340 | ** 3.11% 93.23% 1362 | * 1.31% 87.70% 1368 | *** 3.38% 86.36% 1370 | *** 4.24% 82.98% 1380 | *** 3.52% 78.65% 1390 | * 2.11% 75.10% 1400 | *** 3.89% 72.25% 1412 | *** 3.64% 68.21% 1420 | * 2.02% 64.54% 1440 |************************************************** 54.31% 54.34%
On IPv6 87.7% connections had MTU >= 1422 (1362+60). 75% had MTU >= 1450. (See also: MTU distribution of DNS servers).
Before we move on it's worth reiterating the original problem. Each connection from an eyeball to our Anycast network has three numbers related to it:
- Client advertised MTU - seen in MSS option in TCP header
- True Path MTU value - generally unknown until measured
- Our Edge server MTU - value we are trying to optimize in this exercise
(This is a slight simplification, paths on the internet aren't symmetric so the path from eyeball to Cloudflare could have different Path MTU than the reverse path.)
In order for the connection to misbehave, three conditions must be met:
- Client advertised MTU must be "wrong", that is: smaller than True Path MTU
- Our edge server must be willing to send such large packets: Edge server MTU >= True Path MTU
- The ICMP PTB messages must fail to be delivered to our edge server - preventing Path MTU detection from working.
The last condition could occur for one of the reasons:
- the routers on the path are misbehaving and perhaps firewalling ICMP
- due to the asymmetric nature of the internet the ICMP back is routed to the wrong Anycast datacenter
- something is wrong on our side, for example
In the past we limited our Edge Server MTU value to the smallest possible, to make sure we never encounter the problem. Due to the development of Spectrum UDP support we must increase the Edge Server MTU, while still minimizing the probability of the issue happening.
Finally, relying on ICMP PTB messages for a large fraction of traffic is a bad idea. It's easy to imagine the cost this induces: even with Path MTU detection working fine, the affected connection will suffer a hiccup. A couple of large packets will be dropped before the reverse ICMP will get through and reconfigure the saved Path MTU value. This is not optimal for latency.
In recent days we increased the IPv6 MTU. As part of the process we could have chosen 1300, 1350, or 1400. We choose 1400 because we think it's the next best value to use after 1280. With 1400 we believe 93.2% of IPv6 connections will not need to rely on Path MTU Detection/ICMP. In the near future we plan to increase this value further. We won't settle on 1500 though - we want to leave a couple of bytes for IPv4 encapsulation, to allow the most popular tunnels to keep working without suffering poor latency when Path MTU Detection kicks in.
Since the rollout we've been monitoring
$ nstat -az | grep Icmp6InPktTooBigs Icmp6InPktTooBigs 738748 0.0
Here is a chart of the ICMP PTB packets we received over last 7 days. You can clearly see that when the rollout started, we saw a large increase in PTB ICMP messages (Y label - packet count - deliberately obfuscated):
Interestingly the majority of the ICMP packets are concentrated in our Frankfurt datacenter:
We estimate that in our Frankfurt datacenter, we receive ICMP PTB message on 2 out of every 100 IPv6 TCP connections. These seem to come from only a handful of ASNs:
- AS6830 - Liberty Global Operations B.V.
- AS20825- Unitymedia NRW GmbH
- AS31334 - Vodafone Kabel Deutschland GmbH
- AS29562 - Kabel BW GmbH
These networks send to us ICMP PTB messages, usually informing that their MTU is 1280. For example:
$ sudo tcpdump -tvvvni eth0 icmp6 and ip6[40+0]==2 IP6 2a02:908:xxx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240 IP6 2a02:810d:xx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240 IP6 2001:ac8:xxx > 2400:xxx ICMP6, packet too big, mtu 1390, length 1240
Finally, if you are an IPv6 user with a weird MTU and have misconfigured MSS - basically if you are doing tunneling - please let us know of any issues. We know that debugging MTU issues is notoriously hard. To aid that we created an online fragmentation and ICMP delivery test. You can run it:
If you are a server operator running IPv6 applications, you should not worry. In most cases leaving the MTU at default 1500 is a good choice and should work for the majority of connections. Just remember to allow ICMP PTB packets on the firewall and you should be good. If you serve variety of IPv6 users and need to optimize latency, you may consider choosing a slightly smaller MTU for outbound packets, to reduce the risk of relying on Path MTU Detection / ICMP.
Low level network tuning sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.