Last week Fortune asked Mark Hurd, Oracle co-CEO, how Oracle was going to compete in cloud computing when their capital spending came in at $1.7B whereas the aggregate spending of the three cloud players was $31B. Essentially the question was, if you assume the big three are spending roughly equally, how can $1.7B compete with more than $10B when it comes to serving customers? It’s a pretty good question and Mark’s answer was an interesting one “If I have two-times faster computers, I don’t need as many data centers. If I can speed up the database, maybe I need one fourth as may data centers.”
Of course, I don’t believe that Oracle has, or will ever get, servers 2x faster than the big three cloud providers. I also would argue that “speeding up the database” isn’t something Oracle is uniquely positioned to offer. All major cloud providers have deep database investments but, ignoring that, extraordinary database performance won’t change most of the factors that force successful cloud providers to offer a large multi-national data center footprint to serve the world. Still, Hurd’s offhand comment raises the interesting question of how many data centers will be required by successful international cloud service providers.
I’ll argue the number is considerably bigger than that deployed by even the largest providers today. Yes this represents massive cost given that even a medium sized data center will likely exceed $200m. All the providers are very focused on cost and none want to open the massive number of facilities I predict, so let’s look deeper at the myriad of drivers for large data center counts.
*N+1 Redundancy: The most efficient number of data centers per region is one. There are some scaling gains in having a single, very large facility. But one facility will have some very serious and difficult-to-avoid full-facility fault modes like flood and, to a lesser extent, fire. It’s absolutely necessary to have two independent facilities per region and it’s actually much more efficient and easy to manage with three. 2+1 redundancy is cheaper than 1+1 and, when there are 3 facilities, a single facility can experience a fault without eliminating all redundancy from the system. Consequently, whenever AWS goes into a new region, it’s usual that three new facilities be opened rather than just one with some racks on different power domains.
*Too Big to Fail: Even when building three new data centers when opening up a new region, there are some very good reasons to have more than three data centers as a region grows. There is some absolute data center size where the facility becomes “too big to fail.” This line is gray and open to debate but the limiting factor is how big of a facility can an operator lose before the lost resources and the massive network access pattern changes on failure can’t be hidden from customers. AWS can easily build 100-megawatt facilities, but the cost savings from scaling a single facility without bound are logarithmic, whereas the negative impact of blast radius is linear. When facing seriously sub-linear gains for linear risk, it makes sense to cap the maximum facility size. Over time this cap may change as technology evolves but AWS currently elects to build right around 32MW. If we instead built to 100MW and just pocketed the slight gains, it’s unlikely anyone would notice. But there is a slim chance of full-facility fault, so we elect to limit the blast radius in our current builds to around 32MW.
These groupings of multiple data centers in a redundancy group are often referred to as a region. As the region scales, to avoid allowing any of the facilities that make up the region to become too big to fail, the number of data centers can easily escalate to far beyond ten. AWS already has regions scaled far beyond 10 data centers.
What factors drive a large scale operator to offer more than a single region and how big might this number of regions get for successful international operators? Clearly the most efficient number of regions is one covering the entire planet just as one is the most efficient number of data centers if other factors are ignored. There are some significant scaling cost gains that can be achieved by only deploying a single region.
*Blast Radius: Just as we discovered that a single facility eventually gets too big to fail, the same thing happens with a very large, mega-region. If an operator were to concentrate their world-wide capacity in a single region it would quickly become too big to fail.
I’m proud to say that AWS hasn’t had a regional failure in recent history but the industry continues to see them rarely. They have never been common but they still are within the realm of possibility, so a single region deployment model doesn’t seem ideal for customers. The mega region would also suffer from decaying economics where, just as was the case in the single large data center, the gains from scaling become ever smaller while the downside risks continue to climb. Eventually the incremental cost reductions of scaling the region become quite small while the downside risk continues to escalate.
The mega-region downside risks can be at least partially mitigated by essentially dividing the region up into smaller independent regions but this increases costs and further decreases the scaling gains. Eventually it just make better sense to offer customers alternative regions rather than attempting to scale a single region and the argument in favor of multiple regions become even stronger when other factors are considered.
*Latency and the Speed of Light: The speed of light remains hard to exceed and the round trip time just across North America is nearly 100 ms (Why are there data centers in NY, Hong Kong, and Tokyo). Low latency is a very important success factor in many industries so, for latency reasons alone, the world will not be well served by a single data center or a single region.
Actually it turns out that the speed of light in fiber is about 30% less than the speed of light in other media so it actually is possible to run faster (Communicating data beyond the speed of light). But, without a more fundamental solution to the speed of light problem, many regions are the only practical way to effectively serve the entire planet for many workloads.
There are many factors beyond latency that will push cloud providers to offer a large number of regions and I’m going to argue that latency is not the prime driver of very large numbers of regions. If latency was the only driver the number of required regions would likely be in the 30 to 100 range. Akamai, the world leading Content Distribution Network (CDN), reports more than 1,500 PoPs (Points of Presence) but many experts see them 10x bigger than would be strictly required by latency. Another major CDN, Limelight, reports more than 80 PoPs. This number is closer to the one I would come up with for the number of PoPs required if latency was the only concern. However, latency isn’t the only concern and the upward pressure from other factors appears to dominate latency.
*Networking Ecosystem Inefficiencies: The world telecom market is a bit of a mess with many regions being served by state sponsored agents, monopolies, or a small number of providers that, for a variety of reasons, don’t compete efficiently. Many regions are underserved by providers that have trouble with the capital investment to roll out the needed capacity. Some providers lack the technical ability to roll out capacity at the needed rate. All these factors conspire to produce more than an order of magnitude difference in cost between the (sort of) competitive US market and some other important world-wide markets.
Imagine a $20,000 car in one market costing far more than $200,000 in another market. That’s where we are in the network transit world. This is one of the reasons why all the major cloud providers have private world-wide networks. This is a sensible step and certainly does help but it doesn’t fully address the market inefficiencies around last-mile networks. Most users are only served by a single access network and these last-mile network providers often can’t or don’t own the interconnection networks that link different access networks together. Each access network must be reached by all cloud providers and each of these access networks themselves face a challenge with sometimes unreasonable interconnection fees that increase their costs, especially for video content.
Netflix took an interesting approach to the access network cost problem. Their approach helps Netflix customers and, at the same time, helps access networks serve customers better. Netflix offers to place caching servers (essentially Netflix-specific CDN nodes) in the central offices of access networks. This allows the access network to avoid having to pay the cost to their transit providers to move the bits required to serve their Netflix customers. This also gives the customers of these access networks a potentially higher quality of service (for Netflix content). A further advantage for Netflix is in reducing the Netflix dependence on the large transit providers, it reduces the control these transit providers have over Netflix and Netflix customers. This was a brilliant move and it’s another data point on how many points of presence might be required to serve the world. Netflix reports they have close to 1,000 separate locations around the world.
*Social and Political Factors: We have seen good reason to have order 10^3 regions to deliver the latency required by the most demanding customers. We have also looked at economic anomalies in networking costs requiring O(10^3) regions to fully serve the world economically. What we haven’t talked about yet are the potentially more important social and political factors. Some cloud computing users really want to serve their customers from local data centers and this will impact their cloud provider choices. In addition, some national jurisdictions will put in place legal restrictions than make it difficult to fully serve the market without a local region. Even within a single nation, there will sometimes be local government restrictions that won’t allow certain types of data to be housed outside of their jurisdiction. Even within the same country won’t meet the needs of all customers and political bodies. These social and political drivers again require O(10^3) points of presence and perhaps that many full regions.
As the percentage of servers-side computing hosted in the cloud swings closer to 100%, the above factors will cause the largest of the international cloud providers to have between several hundred to as many as a thousand regions. Each region will require at least three data centers and the largest will run tens of independent facilities. Taking both the number of regions and the number of data centers required in each of these regions into account argues the total data center count of the world largest cloud operators will rise from the current O(10^2) to O(10^5).
It may be the case that there will be many regional cloud providers rather than a small group of international providers. I can see arguments and factors supporting both outcomes but, whatever the outcome, the number of world-wide cloud data centers will far exceed O(10^5) and these will be medium to large data centers. When a competitor argues that fast computers or databases will save them from this outcome, don’t believe it.
Oracle is hardly unique in having their own semiconductor team. Amazon does custom ASICs, Google acquired an ARM team and has done custom ASIC for machine learning. Microsoft has done significant work with FPGAs and is also an ARM licensee. All the big players have major custom hardware investments underway and some are even doing custom ASICs. It’s hard to call which company is delivering the most customer value from these investments, but it certainly doesn’t look like Oracle is ahead.
We will all work hard to eliminate every penny of unneeded infrastructure investment, but there will be no escaping the massive data center counts outlined here nor the billions these deployments will cost. There is no short cut and the only way to achieve excellent world-wide cloud services is to deploy at massive scale.