Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth. - Wikipedia
It’s safe to say there’s been a steady growth in the amount of data stored since ‘The Digital Age’ began. As seen in the infographic below, our ability to store more data has led to an exponential growth in the amount of data stored.
The total amount of digital data generated in 2013 was about 4.4 zettabytes. (That’s 4.4 X 1021 bytes) This is conservatively projected to grow to 44 zettabytes in 2020. (source)
As hardware continues to innovate at a rapid pace (mass storage devices, CPU and networks), this in turn leads to increased capacity for existing database software, enabling growth in the number of transactions per second. But there is still the need to add capacity from time to time.
There are two approaches to scaling a database, both with their own pros and cons: Vertical (Scale Up) and Horizontal (Scale Out).
In the 1975 movie Jaws, three men go hunting for the largest shark ever seen. Martin Brody (played by Rob Scheider) catches a very quick glimpse of the shark, and with terror etched into his face, backs into the boat and delivers one of the best lines in movie history:
“You’re going to need a bigger boat”
Vertically Scaling your Database
This approach involves adding more physical or virtual resources to the underlying server hosting the database – more CPU, more memory or more storage. Basically, you need a bigger boat server. This is the traditional approach, and pretty much every database can be scaled up.
- From a development standpoint, there’s no need to change anything. The same code will connect to the existing database without any issues.
- Easier to implement and administer – there’s a single server to manage, and that’s it. Setup is also straightforward.
- Datacenter costs in terms of space, cooling and power are lower.
- While the initial software costs are low, if licensing is based on the number of cores, then you have to pay more every time you scale up.
- If you’re implementing this on a VMware virtual environment, larger servers (more virtual cores) will inevitably lead to the dreaded Ready Queue, which is difficult to identify, and even harder to solve.
- Hardware costs can also be high, as you have to purchase ‘high-end’ servers.
- There is limited scope for upgrades. A server can only be so big. What happens when your database can no longer fit on the largest available server?
- There’s also the issue of vendor lock-in. You are tied to a single database vendor and if you decide to switch, this might involve a very difficult migration, or starting again from scratch.
In the movie Multiplicity, Dougie Kinney (Michael Keaton) is a construction worker who is struggling with his work-life balance. First of all, he gets a clone to take over at work, so he can spend time with his wife. Then he gets another clone to help out at home. And then the other two clones get together to make a 3rd clone. As you can imagine, this arrangement becomes a little too complex, and almost gets out of hand.
Horizontally scaling your database
This approach involves adding more instances/nodes of the database to deal with increased workload. When you need more capacity, you simply add more servers to the cluster. In addition, the hardware used tends to be smaller, cheaper servers.
Most database products will not scale in this way, and depending on how this is implemented, applications will need to be re-written to work with the database. There are two different techniques that can be used to achieve this.
Data replication – For read-intensive workloads, you can have one primary copy that accepts data changes and multiple read-only replicas of that data. The downside of this is that for data writes, the primary copy becomes a bottle neck.
Federated Database – This involves distributing reads and write across many nodes. This is achieved by partitioning (sharding) the data across multiple database servers. There is an element of database replication in that some data is kept across all or multiple nodes. Some database products (for example, Cassandra) will do this for you, but for the most part, sharding is done on the client-side of the application.
- Much cheaper costs as the individual nodes are smaller.
- Upgrades are easier, all you need to do is add an additional node. In theory, you can add an infinite number of nodes.
- Resilience and Fault tolerance are generally easier to achieve and manage due to the multiple nodes.
- The massive increase in complexity leads to a number of issues –
- Higher incidents of software bugs as the code is more complex.
- Where sharding is used, backups have to contain data from all the database partitions.
- Management of the infrastructure becomes more complex with the increased number of nodes.
- Licensing fees can be higher as you have more nodes to license.
- Datacentre costs in terms of space, cooling and power are higher.
- Networking complexity and costs are higher as latency will lead to inconsistent data.
Scaling your database is a great way to deal with performance bottlenecks. As Big Data becomes more and more prevalent, even the smallest organisations are going to have to look at how to store and process even larger datasets.
We are now well and truly in the era of Big Data – with data sets that are so large, or so complex, that they cannot be processed or stored using traditional databases or applications. Big Data (and by extension, Machine learning) offers a lot of promise, and we are already seeing successes in almost every field. And the way we store and retrieve data is evolving to cope with this trend.
Horizontal scaling is regarded as the more modern approach, however, as you can see, this comes with increased complexity. Your application has to support this model, and as described above, not all products can achieve this.
Every organisation will need to determine what their requirements are for every application, and which approach works best for them.