Apache Cassandra can be a powerful open source data store for applications dealing with large amounts of data. In this blog, we look at what Cassandra is, how it works, and when to use it.
Apache Cassandra is a wide column datastore designed to run on a high number of commodity hosts across wide geography spanning multiple datacenters without a single point of failure.
This means that it can do some impressive things architecturally if you know how to design around the strengths and weaknesses of Cassandra.
As a primer, Cassandra is designed for heavy ingress, with a high number of parallel writes, which differentiates it from other wide-column stores like MongoDB which don’t scale as well on writes.
For a broad overview of features and benefits, including answers to frequently asked questions, be sure to check out our Apache Cassandra Overview.
Counters, in particular, are something Cassandra handles in a way that is unlike a traditional RDBMS such as PostgreSQL, MariaDB, or Oracle.
Consider the following business requirement:
We’re an event management company with a limited, finite, discrete number of tickets for a general admissions event. I need to be able to have ticket buyers express an interest in how many tickets they want to buy, and reserve that number of tickets from the pool. If the customer completes the transaction, the tickets are sold, and the total number of tickets is decremented for the event. If the transaction expires after a certain time, the number of available tickets is incremented for the event.
If you’re an experienced DBA, you’ve already realized there are a few ways one might model the data for this in a relational database that supports ACID transactionality, specifically the functionality of locking a row. Let’s pick a sample schema:
Let’s say that this table is called event_inventory, and it represents a list of events and their price and quantity available. A real application would have a large number of additional requirements to talk about, but we’re going to keep it simple for this blog post.
Cassandra, in this example, would be absolutely the worst at updating tickets_available. This data model is almost fundamentally incompatible with Cassandra. Let's look at why.
Depending on how the developer configures the Consistency Level (CL) for the keyspace, Apache Cassandra guarantees that a row will be persisted to several partitions in the cluster. If one were to choose all, the write transaction would hang open on the client waiting for an OK from every node in the globally distributed cluster and would fail if any of them timed out or were down for maintenance. A CL of ALL is as reliable as having a single MySQL server for your entire enterprise because the outage of a single node brings the entire operation crashing down.
A CL of ANY means we’re very quickly going to accept that our write went somewhere in the cluster and move on to our next operation. This is no good if we’ve got racks on the east and west coast, and we’ve got concert-goers vying for hard to get tickets the moment they come out. If there’s a network delay, we could over-sell these tickets as the results for sales pile up locally in one rack and not another.
If we pick a CL strategy that ensures that every geography has a copy of the data and that all of the nodes have to agree that each rack in the cluster has the latest, greatest data from each coast, is Cassandra ready to go with this model? Unfortunately not, due to the fact that we’ve asked Cassandra to guarantee Consistency of this data, which is extremely expensive for it.
This is because Cassandra can implement a tombstone system to keep track of deleted values. When an old value is updated, it isn’t written in place, but updated to an entirely new value in a journal.
This means Cassandra essentially has to read through a series of values, perhaps hundreds or thousands of results, to get to its final result, “200” tickets available. Amplify this across hundreds or thousands of requests per second to equate to page views for “how many tickets are available,” and you get the picture. Suddenly we’re working very hard to do something that PostgreSQL or MySQL does very easily.
As an architect, I look at the business problem first. If the business problem is “move from PostgreSQL to Cassandra,” you need to address other problems. If the problem is, “we need a highly available, screaming-fast transactional increment/decrement counter,” Cassandra isn’t the answer here. If the problem is, “we’re generating a ton of metadata around the purchasing experience, and we’re streaming that chronology to a MySQL server for analytics and its bogging down, and we can’t remarket in time for a successful ad campaign because batch takes too long—” well stop! You’ve found your Cassandra use case.
Cassandra is an excellent time-series database. If you partition the open source database by region, you can balance the user story by geography or market, allowing this theoretical ad remarketing team to develop some interesting applications. Cassandra is a great time-series database for the same reason it’s a terrible transactional counter: it’s doing all of this writing without guaranteeing consistency.
If these kinds of conversations are the ones you want to be having with your team, we would love to schedule a Webex with your team and our architects to discuss Cassandra support. If you’re just getting into Cassandra or excavating yourself from a bit of a Proof of Concept gone wrong, we can share some incredible insights that our team has developed from decades serving the Fortune 500.
Want to try OpenLogic a support ticket for free? Learn more about our free support trial.
TRY SUPPORT FOR FREE
This case study looks at how a SaaS provider improved system stability and scale with a Cassandra deployment.
See Case Study
See how Cassandra, Kafka, and Spark can team up to tackle large scale data streaming in this white paper.
Download White Paper
Enterprise Architect, OpenLogic by Perforce
With over a decade of experience in enterprise software architecture, engineering, and operations for the Fortune 500, Connor is working to build and support cloud native solutions for OpenLogic customers around the world.