Architecting Applications with Apache Cassandra
Apache Cassandra can be a powerful open source data store for applications dealing with large amounts of data. In this blog, we look at what Cassandra is, how it works, and when to use it.Back to top
Apache Cassandra is a wide column datastore designed to run on a high number of commodity hosts across wide geography spanning multiple datacenters without a single point of failure.
This means that it can do some impressive things architecturally if you know how to design around the strengths and weaknesses of Cassandra.
How It's Used
As a primer, Cassandra is designed for heavy ingress, with a high number of parallel writes, which differentiates it from other wide-column stores like MongoDB which don’t scale as well on writes.
For a broad overview of features and benefits, including answers to frequently asked questions, be sure to check out our Apache Cassandra Overview.Back to top
What Apache Cassandra Shouldn’t Be Used For
Consider the following business requirement:
We’re an event management company with a limited, finite, discrete number of tickets for a general admissions event. I need to be able to have ticket buyers express an interest in how many tickets they want to buy, and reserve that number of tickets from the pool. If the customer completes the transaction, the tickets are sold, and the total number of tickets is decremented for the event. If the transaction expires after a certain time, the number of available tickets is incremented for the event.
If you’re an experienced DBA, you’ve already realized there are a few ways one might model the data for this in a relational database that supports ACID transactionality, specifically the functionality of locking a row. Let’s pick a sample schema:
Let’s say that this table is called event_inventory, and it represents a list of events and their price and quantity available. A real application would have a large number of additional requirements to talk about, but we’re going to keep it simple for this blog post.
Cassandra, in this example, would be absolutely the worst at updating tickets_available. This data model is almost fundamentally incompatible with Cassandra. Let's look at why.
Attaining Consistent Data
Depending on how the developer configures the Consistency Level (CL) for the keyspace, Apache Cassandra guarantees that a row will be persisted to several partitions in the cluster. If one were to choose all, the write transaction would hang open on the client waiting for an OK from every node in the globally distributed cluster and would fail if any of them timed out or were down for maintenance. A CL of ALL is as reliable as having a single MySQL server for your entire enterprise because the outage of a single node brings the entire operation crashing down.
A CL of ANY means we’re very quickly going to accept that our write went somewhere in the cluster and move on to our next operation. This is no good if we’ve got racks on the east and west coast, and we’ve got concert-goers vying for hard to get tickets the moment they come out. If there’s a network delay, we could over-sell these tickets as the results for sales pile up locally in one rack and not another.
Picking a Consistency Level
If we pick a CL strategy that ensures that every geography has a copy of the data and that all of the nodes have to agree that each rack in the cluster has the latest, greatest data from each coast, is Cassandra ready to go with this model? Unfortunately not, due to the fact that we’ve asked Cassandra to guarantee Consistency of this data, which is extremely expensive for it.
This is because Cassandra can implement a tombstone system to keep track of deleted values. When an old value is updated, it isn’t written in place, but updated to an entirely new value in a journal.
This means Cassandra essentially has to read through a series of values, perhaps hundreds or thousands of results, to get to its final result, “200” tickets available. Amplify this across hundreds or thousands of requests per second to equate to page views for “how many tickets are available,” and you get the picture. Suddenly we’re working very hard to do something that PostgreSQL or MySQL does very easily.Back to top
How to Accomplish More with Cassandra
As an architect, I look at the business problem first. If the business problem is “move from PostgreSQL to Cassandra,” you need to address other problems. If the problem is, “we need a highly available, screaming-fast transactional increment/decrement counter,” Cassandra isn’t the answer here. If the problem is, “we’re generating a ton of metadata around the purchasing experience, and we’re streaming that chronology to a MySQL server for analytics and its bogging down, and we can’t remarket in time for a successful ad campaign because batch takes too long—” well stop! You’ve found your Cassandra use case.
Powerful Time-Series Database
Cassandra is an excellent time-series database. If you partition the open source database by region, you can balance the user story by geography or market, allowing this theoretical ad remarketing team to develop some interesting applications. Cassandra is a great time-series database for the same reason it’s a terrible transactional counter: it’s doing all of this writing without guaranteeing consistency.Back to top
Getting Support for Your Apache Cassandra
If these kinds of conversations are the ones you want to be having with your team, we would love to schedule a Webex with your team and our architects to discuss Cassandra support. If you’re just getting into Cassandra or excavating yourself from a bit of a Proof of Concept gone wrong, we can share some incredible insights that our team has developed from decades serving the Fortune 500.
This case study looks at how a SaaS provider improved system stability and scale with a Cassandra deployment.
See how Cassandra, Kafka, and Spark can team up to tackle large scale data streaming in this white paper.Back to top