The distributed nature of Apache Cassandra – boasting an architecture that avoids any single point of failure – enables the open source database to deliver the high scalability, near 100% availability, and powerful read-and-write performance required for many data-heavy use cases. These capabilities have deservedly fueled a steady rise in the popularity of Cassandra, but many developers new to the NoSQL database are finding out that they need to first understand a few of its complexities in order to yield the best performance.
Here are four of the most common pitfalls that challenge those new to Cassandra – and how to avoid them:
1. Immediately Changing Cassandra’s Default Settings
It can be tempting to tinker with Cassandra’s default settings right out of the gate. But it’s a practice that is almost certain to have unintended and negative consequences.
The fine-tuning of settings should only be undertaken after achieving a full comprehension of the Cassandra approach to data modelling, historical trends and usage patterns of the client applications, and awareness of the potential ramifications to the cluster. Too often, developers make the mistake of prematurely optimizing Cassandra’s settings without first establishing this information.
For example, they will allocate extra memory to Cassandra’s JVM heap – a practice that, while well-intentioned, can actually increase latency and result in a poor experience. Those just starting out with Cassandra should ensure that they become well-versed in the database before making even basic changes.
2. Treating Cassandra Like a Relational Database
The practices that optimize relational database modeling do not apply to Cassandra and are oftentimes counter-effective. Alas, developers coming to Cassandra from a relational background will often apply those practices anyway and find this out the hard way.
For example, one of Cassandra’s strengths is its ability to handle high volumes of writes. Developers can take advantage of this to create new access patterns via the denormalization of their data models, which results in duplicated data. While this may be considered an anti-pattern in the world of relational data modelling, devs are only wasting their time when they work to minimize writes or data duplication in their data models for Cassandra (while also then missing out on Cassandra’s advantages in these areas). They should also avoid the mistake of building data models around data relationships or objects as they would in a relational database, and instead model around queries themselves.
Cassandra will face performance challenges when serving applications that produce a heavy load of updates or deletes. To address this, developers should work to spread data evenly throughout the cluster, and to minimize the number of partitions to read. This may also be addressed by making use of Cassandra’s time-series and TTL features when designing the data model. Most of all, many will need to recalibrate their ideas about database optimization and develop new skills that fall in line with the reality of how Cassandra functions.
3. Failing to Continuously Monitor Cassandra
Successfully navigating around the two pitfalls above isn’t the end of the story. While Cassandra is powerful and will prove itself capable in overcoming outages and network partitions, don’t make the mistake of treating it as a set-it-and-forget-it solution. Constant monitoring of key performance indicators such as latency, disk usage, and throughput is critical to maintaining an optimal deployment.
24/7 monitoring is necessary because both internal and external changes to Cassandra usage patterns are very common. Applications developed using agile methods undergo regular revisions, often introducing new features that interact with data in new ways. At the same time, external users shift their patterns, both in response to application changes and their own evolving needs. Changes in usage patterns, even if temporary, can create spikes in read and write operations or changes in the types of data being added to the database. All of this has the potential to trigger increased latency or even data availability issues.
It is crucial that Cassandra operators put in place the infrastructure required to continuously monitor Cassandra deployments. It is equally important that Cassandra operators and developers gain the necessary skills to comprehend the monitoring data produced by Cassandra, so that they are able to effectively respond to alerts, resolve issues, and optimize Cassandra deployments.
4. Overlooking the Importance of Security
With all the work it takes to complete production deployments on schedule, it can be tempting to slot security a lower priority than it deserves. However, ensuring that Cassandra’s security features are correctly configured is crucial to long-term performance and availability.
An end-to-end security strategy is, of course, essential for protecting proprietary data and other valuable information from continuously evolving threats. But certain industries are also governed by specific legal and regulatory requirements, which delineate how data must be handled and secured. Failure to effectively comply with such requirements may result in substantial penalties. Effective security calls for the implementation of measures that can prevent, detect, and remediate the effects of data breaches. Inherent to Cassandra are many powerful security features that developers would be wise to fully understand and utilize (many do not).
Operators should be well versed in the process of patching Cassandra in production environments. The regular application of software upgrades and patches is critically important for the proper maintenance and security of all open-source technologies, and Cassandra is no exception.
Justin Cameron is a senior software engineer and Instaclustr, where his responsibilities include working with organizations developing complex applications on top of Cassandra, Kafka, and Spark. Prior to joining Instaclustr, Justin was a Security Researcher at BAE Systems.