This is the second in a series of blog posts in which we outline our experience with developing the first HTML5 game-streaming service for the web.
In the previous entry to this series, we briefly covered networking and how we leverage WebRTC as our preferred communication method, falling back to WebSockets as a last resort to ensure users can connect to their gaming PC anywhere. Today we are going to dive into the challenges we faced building this reliable fallback and how we solved them.
When we first entered beta back in January, we did so to much fanfare, a little too much. Our servers were instantly overwhelmed, and our production environment came to a halt. After days of work, we were able to get back up and running, but a new issue arose — we crashed our DNS providers data center.
Our DNS records were set to only cache for 60 seconds as a users IP can change for any number of reasons, and we needed to manage their public and local addresses on a valid hostname for SSL/TLS certificate generation.
Our provider was understanding of our use case, and we appreciated their patience, so to remedy the issue and raised our TTL to 24 hours, introducing a slew of new problems. Now if a user’s IP changed, the old IP would be used unless they manually cleared their DNS cache which is terrible for the user experience. We also began implementing support for multiple host meaning now users could need dozens of IPs under a single account. It became very apparent having dozens of records per user with long TTL’s wouldn’t scale, and we needed to think outside the box.
Our solution to this issue is Mimic; a high-performance dynamic DNS server built using .NET Core that allows records to be resolved to a domain instantly, without propagation. Mimic is stateless and deterministically maps domain names to IP addresses (IPv4 and IPV6) by parsing DNS requests for IP information that is appended as a sub-domain to a domain.
Mimic operates on the principle of fast returns and a low memory footprint. We can do this by leveraging APIs like
Span and other
unsafe low-level operations. When handling millions of requests per hour, String parsing can allocate gigabytes in memory and calls to
string.SubString are relatively slow. Span enables the representation of contiguous regions of arbitrary memory, regardless of whether that memory is associated with a managed object, is provided by native code via interop, or is on the stack. And it does so while still providing safe access with performance characteristics like that of arrays.
string str = "hello, world";
string worldString = str.Substring(startIndex: 7, length: 5); // Allocates
ReadOnlySpan<char> worldSpan = str.AsSpan().Slice(start: 7, length: 5); // No allocation
Before data is ready to be pinned as ReadOnly it still needs to be processed. Let’s assume for a moment a DNS requests is received that contains uppercase characters such as
51–183–103–23.user62.cYR.ax (yes a browser did this.) We need to flatten this to lowercase to decrease the total ASCII table size for faster parsing. Usually the following call would be just fine.
var lowerString = Encoding.ASCII.GetString(byters).ToLower();
However, this will allocate memory while being incredibly slow as the length of data increases. A fast solution to this is to walk bytes and set them to the lowercase equivalent manually.
const byte asciiUFirst= 65;
const byte asciiULast= 90;
//Span can be reused without reallocation.
Span<byte> domainSpan = domainBytes;
for (var i = 0; i < domainSpan.Length; i++)
if (asciiUFirst <= domainSpan[i] && domainSpan[i] <= asciiULast)
domainSpan[i] = (byte)(domainSpan[i] | 32);
The execution of the above function takes less than a millisecond and allocates zero new memory. When we are ready to reply to the DNS request, a DNS response is built using a previous stack allocation thus avoiding allocating additional memory, taking us from a few gigabytes per million requests to only a few megabytes in a managed language.
To reduce load, we also set a high TTL for incoming requests to ensure browsers persistently cache addresses and avoid useless resolve requests to our backend. Combine this with our globally distributed infrastructure and records can be resolved faster than CloudFlare.
To ensure each user is protected domains are isolated from one another. Much like “google.com” and “youtube.com” are separate websites, user domains are unable to share cookies or cross-origin requests to one another.
Creating Mimic has had some positive side effects on our user experience. For instance, the time it takes to connect over a WebSocket has been reduced from hundreds of milliseconds to only a few, making the application feel faster.
We also use Let’s Encrypt to generate certificates for our users and validate request via TXT records. In the past, we used our DNS provider and TXT records could take up to 60 seconds to propagate — that is a long time to keep a user waiting. Now with Mimic and a fast key-value database such as Redis, we can instantly validate TXT records and continue doing wide-scale automated SSL/TLS certificate deployment in seconds, not minutes.
Building out our DNS infrastructure was an intimidating task, but the results have been more than worth the effort. Mimic is still in its incubation period and will be fully rolled out in January for launch.
If you’d like to learn more about Rainway, please check out our other blogs or join our Discord if you have any questions or would love to chat.
Until next time!