This is the first in a series of blog posts in which we outline our experience with developing the first HTML5 game-streaming service for the web.
With the recent announcement of xCloud from Microsoft and Project Stream from Google entering beta testing, we are in the early phases of a new streaming revolution. Companies like Spotify and Netflix reshaped forever how we listened to music and watched our favorite shows and movies, and now, that same change is coming to gaming.
While we’re big fans of Cloud Gaming at Rainway and have known a “Netflix for games” was inevitable, our focus has always been to be more of a “Plex for video games” by letting users enjoy the games they own anywhere, on anything. Today we’d like to share the lessons we’ve learned building this service over the last year and how we plan to continue improving it.
WebRTC is our preferred method of communication as “unreliable” data channels allow for connectionless performance over SCTP. We’ve also chosen to open source our WebRTC library, Spitfire, so developers can build exciting web experiences with a native WebRTC server.
For a WebRTC connection to be established we must signal. Signaling is the process of coordinating communication. For a WebRTC application to set up a connection, its clients need to exchange information on their networks as well as other metadata.
Our signaling is currently powered by Thunderstorm, a .NET Core service that brokers information to connected peers behind NAT’s. We now run a single instance of Thunderstorm to handle millions of messages from our users and, while it does so without a sweat, soon it will be replaced by a new project called Aqueduct which will be covered in the next entry to this series.
While WebRTC is a fantastic technology for the web, it does have trouble connecting on some particular nasty NAT’s. Without a TURN server to relay data, most applications at this point will bring up a fatal error message.
Accepting this didn’t sit well with us. Users shouldn’t have to miss out just because of their network. So we tackled the problem head-on by building our own PKI (public key infrastructure) on top of Let’s Encrypt (who we are proud to sponsor) that allows us to do wide-scale automated SSL/TLS certificate deployment for our users. Once a user has a certificate on their account, we fall back to WebSockets which have been optimized for low-latency and TCP hole punch through port 443. We have been pleased with the success rate but still plan to further maximize this method by rolling out Mimic, our zero-propagation dynamic DNS which will also be covered in this series.
Finally, both networking methods take advantage of Sachiel, a simple to use messaging framework built on protocol buffers. With Sachiel, we can create schemas and generate request/response models in any language. Prebaking our models at runtime also allows us to do sub-millisecond deserialization even in the browser.
As with networking, there are not many options in browsers for real-time video/audio. WebRTC is the obvious choice and widely supported. Sites like Mixer used it to significant effect to get sub-second live streaming and Project Stream uses WebRTC video for their service. Using it, however, comes with caveats.
The underlying Byte Stream is mostly out of your hands as well as networking optimizations, meaning, you have to trust the browser to do it for you. While Google is optimizing their implementation because of Project Stream, our goal is not to create a Chrome-only web application.
With no other choice, we must use the Media Source Extensions API. A straightforward API that allows us to render fragmented MP4 files with hardware-acceleration. Unfortunately, while this sounds like the perfect solution, the Media Source spec was not designed for our use case. It is very sensitive to missing data, can create huge buffers that increase input latency and drifts easily due to browser throttling. There are some nasty hacks to try and fix this like recreating the video every key-frame — but you end up breaking most browsers, have poor performance and a lackluster user experience.
All is not lost, however. There are ongoing discussions for a “real-time” mode to be introduced in MSE vNext and in the meantime, we were able to solve this problem by making meticulous optimizations to our video data that for lack of a better description, encourages decoders to not wait for more video input before decoding. This is not to be confused with the trick of using heuristics to tell Chrome to use low-delay mode which at best, gets you about 70 MS of video buffer.
The result of all our work? You can play Black Ops 4 on an x32 Intel Compute stick.
While we do make use of vendor-specific API’s such as Keyboard Lock (which we hope to see other browsers implement), from the very beginning, we knew we had to support all browsers. Users deserve a choice in how they use web applications and forcing them to switch into a particular browser risks bringing back the days of Internet Explorer. We are proud to say we support all spec compliant browser so users can retain their freedom and choice when playing their games via Rainway — we have even gone as far as to support browsers such as Brave and Vivaldi.
Unfortunately, not all browsers work so well. Edge, for example, has an issue where audio and video data need at least 3 seconds of buffer before playing and even after a lot of reports we still haven’t seen improvements. It also does not support WebRTC data channels, listing them a low priority and WebSockets are limited to public addresses only unless you run a PowerShell command and edit the registry manually. What makes this genuinely tragic is Edge is quite good at rendering our UI quickly, and these pitfalls make it unusable for our users.
On the host side, we can embed performance data related to the encoding process (framerate, processing time, etc.) directly into our video, both as an overlay and byte offset parseable data. The client side has its separate logic for measuring rendering and network latency as well as data throughput. Combining all of this we can immediately spot the bottleneck in a user’s setup and determine if it is the host, network or client.
A great heuristic we use to track if our overall performance is good is the number of games being played via Rainway after each update. Thus far we are at well over 200,000 unique sessions, and we see no signs of it slowing down.
The ever talented Rainway team was able to bring our fast game-streaming technology to the web by pushing hard against obstacles and never cutting corners. We plan to continue optimizing our Connector web-application by leveraging powerful browser APIs such as WebAssembly to introduce our new low-level network layer, Coffee, which is capable feats such as blazing forward error correction at rates up to 40,000 MB/s. As a Techstars company, we follow the rule of “give first,” so Coffee will be utterly open-source once it is ready so anyone can leverage it for building blazing fast real-time applications in the browser on unreliable networks.
I hope you’ve enjoyed this first deep dive into how we created Rainway. The last eight months have been a lot of trial and error, and with our plans to leave beta in January 2019, we are working around the clock to push out the optimizations from all the lessons we’ve learned.
If you’d like to learn more about Rainway, please check out our other blogs or join our Discord if you have any questions or would love to chat.
Until next time!
The next entry in this series will cover our cloud infrastructure and how it’s evolved over time.