Into the Borg – SSRF inside Google production network


In March 2018, I reported an XSS in Google Caja, a tool to securely embed arbitrary html/javascript in a webpage.
In May 2018, after the XSS was fixed, I realised that Google Sites was using an unpatched version of Google Caja, so I looked if it was vulnerable to the XSS. However, the XSS wasn’t exploitable there.

Google Caja parses html/javascript and modifies it to remove any javascript sensitive content, such as iframe or object tags and javascript sensitive properties such as document.cookie. Caja mostly parses and sanitizes HTML tags on the client side. However, for remote javascript tag (<script src=”xxx”>), the remote resource was fetched, parsed and sanitized on the server-side.
I tried to host a javascript file on my server (https://[attacker].com/script.js) and check if the Google Sites server would fall for the XSS when parsed server-side but the server replied that https://[attacker].com/script.js was not accessible.

After a few tests, I realised that the Google Sites Caja server would only fetch Google-owned resources like https://www.google.com or https://www.gstatic.com, but not any external resource like https://www.facebook.com.
That’s a strange behavior because this functionality is meant to fetch external resources so it looks like a broken feature. More interestingly, it is hard to determine whether an arbitrary URL belongs to Google or not, given the breadth of Google services. Unless…

Finding an SSRF in Google

Whenever I find an endpoint that fetches arbitrary content server-side, I always test for SSRF. I did it a hundred times on Google services but never had any luck. Anyway the only explanation for the weird behavior of the Google Caja server was that the fetching was happening on the internal Google network and that is why it could only fetch Google-owned resources but not external resources. I already knew this was a bug, now the question was whether it was a security bug!

It’s very easy to host and run arbitrary code on Google servers, use Google Cloud services! I created a Google App Engine instance and hosted a javascript file. I then used the URL of this javascript file on Google Sites as a external script resource and updated the Google Sites page. The javascript was successfully fetched and parsed by Google Caja server. I then checked my Google App Engine instance logs to see from where the resource was fetched and it came from 10.x.x.201, a private network IP! This looked very promising.

I used the private IP as the url for the Google Sites javascript external resource and waited for the moment of truth. The request took more than 30 seconds to complete and at that time I really thought the request was blocked and I almost closed the page since I never had any luck with SSRF on Google before. However, when Google Caja replied, I saw that the reply size wasn’t around 1 KB like for a typical error message but 1 MB instead! One million bytes of information coming from a 10.x.x.x IP from Google internal network, I can tell you I was excited at this point! 🙂
I opened the file and indeed it was full of private information from Google! \o/

Google, from the inside

First I want to say that I didn’t scan Google’s internal network. I only made 3 requests in the network to confirm the vulnerability and immediately sent a report to Google VRP. It took 48 hours to Google to fix the issue (I reported it on a Saturday), so in the meantime I couldn’t help but test 2-3 more requests to try to pivot the SSRF vulnerability into unrestricted file access or RCE but without luck.

Architecture of Borg

The first request was to http://10.x.x.201/. It responded with a server status monitoring page of a “Borglet”. After a Google search, I could confirm that I was indeed inside Borg, Google’s internal large-scale cluster management system (here is a overview of the architecture). Google have open sourced the successor of Borg, Kubernetes in 2014. It seems that while Kubernetes is getting more and more popular, Google is still relying on Borg for its internal production infrastructure, but I can tell you it’s not because of the design of Borg interfaces! (edit: this is intended as a joke 😛 )
The second request was to http://10.x.x.1/ and it was also a monitoring page for another Borglet. The third request was http://10.x.x.1/getstatus, a different status monitoring page of a Borglet with more details on the jobs like permissions, arguments.

Each Borglet represents a machine, a server.

On the hardware side, both servers were using Haswell’s CPU @2.30GHz with 72 cores, which corresponds to a set of 2 or 3 Xeon E5 v3. Both servers were using the CPUs at 77%. They had 250GB of RAM, which was used at 70%. They had 1 HDD each with 2TB and no SSD. The HDD were almost empty with only 15GB used, so the data is stored elsewhere.

The processing jobs (alloc and tasks) are very diverse, I believe this optimizes ressource usage with some jobs using memory, others using CPU, network, some with high priority, etc… Some services seem very active : Video encoding, Gmail and Ads. That should not be surprising since video processing is very heavy, Gmail is one of the main Google services and Ads is, well, Google’s core business. 😉
I didn’t see Google Sites or Caja in the jobs list, so either the SSRF was going through a proxy or the Borglet on 10.x.x.201 was from a different network than the 10.x.x.201 IP I saw in my Google App Engine instance logs.

Regarding the architecture, we can find jobs related to almost all of the components of the Google Stack, in particular MapReduce, BitTable, Flume, GFS… On the technology side, Java seems to be heavily used. I didn’t see any mention of Python, C++, NodeJS or Go, but that doesn’t mean it wasn’t used so don’t draw conclusions. 😛

I should mention that Borg, like Kubernetes, relies on containers like Docker, and VMs. For video processing, it seems they are using Gvisor, a Google open-source tool that looks like a trade-off between containers performance and VMs security.