Posted by Natalie Silvanovich, Project Zero
Over the past five years, video conferencing support in websites and applications has exploded. Facebook, WhatsApp, FaceTime and Signal are just a few of the many ways that users can make audio and video calls across networks. While a lot of research has been done into the cryptographic and privacy properties of video conferencing, there is limited information available about the attack surface of these platforms and their susceptibility to vulnerabilities. We reviewed the three most widely-used video conferencing implementations. In this series of blog posts, we describe what we found.
This part will discuss our analysis of WebRTC. Part 2 will cover our analysis of FaceTime. Part 3 will discuss how we fuzzed WhatsApp. Part 4 will describe some attacks against WhatsApp that didn’t work out. And finally, Part 5 will discuss the future of video conferencing and steps that developers can take to improve the security of their implementation.
All the video conferencing implementations we investigated allow at least two peers anywhere on the Internet to communicate through audiovisual streams. Implementing this capability so that it is reliable and has good audio and video quality presents several challenges. First, the peers need to be able to find each other and establish a connection regardless of NATs or other network infrastructure. Then they need to be able to communicate, even though they could be on different platforms, application versions or browsers. Finally, they need to maintain audio and video quality, even if the connection is low-bandwidth or noisy.
Almost all video conferencing solutions have converged on a single architecture. It assumes that two peers can communicate via a secure, integrity checked channel which may have low bandwidth or involve an intermediary server, and it allows them to create a faster, higher-bandwidth peer-to-peer channel.
Once signalling is complete, the peers find a way to route traffic to each other using the STUN, TURN and ICE protocols. Based on what these protocols determine, the peers can create UDP, UDP-over-STUN and occasionally TCP connections based of what is favorable for the network conditions.
I started by looking at WebRTC signalling, because it is an attack surface that does not require any user interaction. Protocols like RTP usually start being processed after a user has picked up the video call, but signalling is performed before the user is notified of the call. WebRTC uses SDP for signalling.
I reviewed the WebRTC SDP parser code, but did not find any bugs. I also compiled it so it would accept an SDP file on the commandline and fuzzed it, but I did not find any bugs through fuzzing either. I later discovered that WebRTC signalling is not implemented consistently across browsers anyhow. Chrome uses the main WebRTC implementation, Safari has branched slightly and Firefox uses their own implementation. Most mobile applications that use WebRTC implement their own signalling in a protocol that is not SDP as well. So it is not likely that a bug in WebRTC signalling would affect a wide variety of targets.
I then decided to look at how RTP is processed in WebRTC. While RTP is not an interaction-less attack surface because the user usually has to answer the call before RTP traffic is processed, picking up a call is a reasonable action to expect a user to take. I started by looking at the WebRTC source, but it is very large and complex, so I decided fuzzing would be a better approach.
The WebRTC repository contains fuzzers written for OSS-Fuzz for every protocol and codec supported by WebRTC, but they do not simulate the interactions between the various parsers, and do not maintain state between test cases, so it seemed likely that end-to-end fuzzing would provide additional coverage.
Setting up end-to-end fuzzing was fairly time intensive, so to see if it was likely to find many bugs, I altered Chrome to send malformed RTP packets. I changed the srtp_protect function in libsrtp so that it ran the following fuzzer on every packet:
void fuzz(char* buf, int len){ int q = rand()%10; if (q == 7){ int ind = rand()%len; buf[ind] = rand(); } if(q == 5){ for(int i = 0; i < len; i++) buf[i] = rand(); } |
RTP fuzzer (fuzzer q)
When this version was used to make a WebRTC call to an unmodified instance of Chrome, it crashed roughly every 30 seconds.
Most of the crashes were due to divide-by-zero exceptions, which I submitted patches for, but there were three interesting crashes. I reproduced them by altering the WebRTC source in Chrome so that it would generate the packets that caused the same crashes, and then set up a standalone build of WebRTC to reproduce them, so that it was not necessary to rebuild Chrome to reproduce the issues.
if (frame->frame_type() == kVideoFrameKey) {
...
GofInfo info = gof_info_.find(codec_header.tl0_pic_idx)->second;
FrameReceivedVp9(frame->id.picture_id, &info);
UnwrapPictureIds(frame);
return kHandOff;
}
If this value does not exist in the gof_info_ array, std::map::find returns the end value of the map, which points to one element past the allocated values for the map. Depending on memory layout, dereferencing this iterator will either crash or return the contents of unallocated memory.
This tool also had the benefit of enabling much faster fuzzing, as it was possible to fuzz RTP by fuzzing the RTP dump file and loading it into video_replay. There were some false positives, as it was also possible that fuzzing caused bugs in parsing the RTP dump file format, but most of the bugs were actually in RTP processing.
Fuzzing with the video_replay tool with code coverage and ASAN enabled led to four more bugs. We ran the fuzzer on 50 cores for about two weeks to find these issues.
Overall, end-to-end fuzzing found a lot of bugs in WebRTC, and a few were fairly serious. They have all now been fixed. This shows that end-to-end fuzzing is an effective approach for finding vulnerabilities in this type of video conferencing solution. In Part 2, we will try a similar approach on FaceTime. Stay tuned!