Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Cassandra for SQL Developers

Dive deeper into Cassandra's design and implementation and find sensible advice and guidelines to get your development rolling.



Cassandra is a fantastic NoSQL database. It was designed with Big Data in mind. When I say big data I simply mean data that doesn't fit on one machine. Cassandra is fast, robust (no single point of failure), highly available and linearly scalable database with multi-data center support. It achieves all that by having laser focus and carefully crafting the features it supports--and just as importantly--the features it doesn't support.

Managing Large Volumes of Data with Apache Cassandra NoSQL does a great job of introducing Cassandra in general. In this article, I will dive deeper into Cassandra's design and implementation and provide sensible advice and guidelines.

30,000 Foot View of Cassandra's Storage Model

The first rule of the Cassandra storage model is that all disk access is just appends. Nothing is ever changing. This allows remarkable write performance. Cassandra doesn't need to do expansive seeks to find the right place to write some data. How can you do inserts, updates and deletes with just appends? The trick is using a cascading storage model. When a client requests to insert, update or delete some data. The request is written to an append-only commit log file. Once, the request is in the commit log the client will get an acknowledgment that his request succeeded. I will explain thoroughly fault tolerance, consistency and more in future installments.

Let's focus on the happy path for now. The requests in the commit log are then processed and indexed into a memtable (an in-memory data structure). When the memtable is full it is written to the workhorse of persistent storage - the SSTable (sorted string table). The entire SSTable is written once and it never changes. What happens when an update request comes in for some data that is stored in such an immutable SSTable? Easy. It is written to a memtable first and later to a new SSTable. When a query comes in for the modified data Cassandra checks if the data is in the memtable and returns it form there or in the SSTablesta from newest to oldest. This ensures that the client sees the latest data. For deleted data a special record called "tombstone" is created to avoid returning deleted data. Every now and then a process called "compaction" will consolidate multiple SSTables into a single new SSTable, which will contain only the latest data and will get rid of the all the old SSTables.

Cassandra CQL 3.0 - Almost, but not Completely, Different from SQL

The Cassandra storage model is very different than the way typical SQL database engines store their data, but the Cassandra designers wanted to provide the cozy familiarity of SQL to Cassandra users. This resulted in CQL 3.0, which features syntax very similar to SQL. Unfortunately, the semantics are often totally different. Basic terms like row, table and primary key don't mean the same thing in CQL as in SQL. On top of that Cassandra has its own terminology that is often mentioned in the documentation, but there aren't always direct analogs SQL nomenclature. This creates a serious cognitive dissonance when you start modeling with Cassandra.

The best way to understand why this situation exists is to imagine how you would implement the Cassandra core features using a relational DB. People have done it many times using as a shared relational DB. To split your data across multiple machines you put some of the data of each table on each machine. This is super complicated of course, because your application code needs to figure out where each piece of data should go both for inserting/updating/deleting and for querying. Then, when the data grows you need to add more machines and redistribute the data between the new machines and don't forget dealing with nodes going down. So, Cassandra did all this hard work but how do you put an SQL-like face on all of it? Keep reading if you dare...

SQL vs. CQL - Fight!

Suppose we are writing a chat application and we want to keep track of all the messages that users send in a database. For the purpose of this article our chat application has just a single shared room where every user can see the messages from all the other users.

In a relational database (e.g. MySQL) we may define a table like so:

CREATE TABLE messages( id INT, sender TEXT, message TEXT, timestamp TIMESTAMP, PRIMARY KEY (id));

In Cassandra, we can define it with exactly the same syntax:

CREATE TABLE messages( id INT, sender TEXT, message TEXT, timestamp TIMESTAMP, PRIMARY KEY (id));

Note, that Cassandra has its own list of data types, which is pretty close to the data types used by most relational databases. This is not a problem because even relational databases differ from each other in this regard.

Let's see in the command-line clients what this table looks like in each database. We can use the 'describe messages' command in MySQL client and 'describe table messages' in cqlsh.


In Cassandra:

cqlsh:test> describe table messages; CREATE TABLE messages ( id int, message text, sender text, "timestamp" timestamp, PRIMARY KEY (id) ) WITH bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};

Ok. Never mind the details, but you can see that Cassandra attaches a lot of operational information to each table. Also, the fields are not ordered the same way in which they were specified in the CREATE TABLE command. The 'message' field' shows up before the 'sender' field.

Let's insert some data into both tables. Guess what? You can use the familiar INSERT syntax to insert data into both MySQL and Cassandra:

INSERT INTO messages (id, message, sender, timestamp) VALUES (1, 'Hello', 'clingy-gigi', '2015-01-10 11:32:00'); INSERT INTO messages (id, message, sender, timestamp) VALUES (2, 'Will anybody talk to me?', 'clingy-gigi', '2015-01-10 11:32:01'); INSERT INTO messages (id, message, sender, timestamp) VALUES (3, 'Talk to the hand', 'evil-gigi', '2015-01-10 11:32:02'); INSERT INTO messages (id, message, sender, timestamp) VALUES (4, 'Please?', 'clingy-gigi', '2015-01-10 11:32:03'); INSERT INTO messages (id, message, sender, timestamp) VALUES (5, 'Fugetaboutit', 'evil-gigi', '2015-01-10 11:32:04'); INSERT INTO messages (id, message, sender, timestamp) VALUES (6, 'Pretty, pretty please?', 'clingy-gigi', '2015-01-10 11:32:05'); INSERT INTO messages (id, message, sender, timestamp) VALUES (7, 'Ain''t nobody got time for that', 'evil-gigi', '2015-01-10 11:32:06'); INSERT INTO messages (id, message, sender, timestamp) VALUES (8, 'I really need someone to talk to', 'clingy-gigi', '2015-01-10 11:32:07'); INSERT INTO messages (id, message, sender, timestamp) VALUES (9, 'Y U No shut up?', 'evil-gigi', '2015-01-10 11:32:08');

How about some queries? Let's get all the messages from MySQL:

And, from Cassandra:

Still looks pretty much the same. Although, in Cassandra the data is not ordered by id as it is in MySQL. Hey, this Cassandra thing doesn't look that bad. You can just take your mad SQL skills and apply them to Cassandra... Or could you?

Let's try getting just the messages from evil-gigi.


In Cassandra:

cqlsh:test> select * from messages where sender='evil-gigi'; Bad Request: No indexed columns present in by-columns clause with Equal operator

Oops. What happened? Cassandra doesn't support ad-hoc queries. It just doesn't. This is just one among many limitations and the reason always comes back to Cassandra's design goals.

Cassandra does support secondary indexes that allow this type of query. But, secondary indexes come with so many strings attached and sinister warnings about when it is proper to use them that most people just avoid them. Here are a few good articles that discuss the finer points of secondary indexes:

Primary Keys in Cassandra

Let's talk about primary keys. In relational databases a table with a primary key is always indexed by its primary key. This means you can query efficiently for ranges of primary keys (or any indexed column for that matter) such as:

Surely Cassandra can do that too?

cqlsh:test> select * from messages where id < 5; Bad Request: Only EQ and IN relation are supported on the partition key (unless you use the token() function)

Nope. Not really. With Cassandra you can almost get that by specifying a list of ids:

Cassandra Partition Key and Clustering Key

There is much more to Cassandra's primary keys. First of all, unlike relational databases, there must be a primary key, but that's not that unusual because every relational database I have ever used always defined a primary key for each table as a best practice. But, Cassandra's primary keys are actually composed of two separate components: the partition key and the clustering columns. Each one of these components can be comprised of multiple columns. All the rows with the same partition key live on the same machine (well, there is replication, so all rows with the same partition key lives on each one of its replica set machine).

That has very important implications for how you should select your partition key. If too many rows have the same partition key and exceed the storage capacity of their designated machine you're in trouble even if the cluster as a whole has plenty of capacity left. Let's create another table that stores server log messages and evolve the Cassandra schema. Here is version 1:

CREATE TABLE log_central( server_ip TEXT, log_message TEXT, timestamp TIMESTAMP, PRIMARY KEY (server_ip);

The primary key here is the server ip address. We can query by the ip address and get all the messages from each server:

select * from log_central where server_ip='111.222.333.444'

That's pretty good, but there are two problems. First, a server may generate a lot of log messages, so that eventually it may exceed the capacity of the machine and we'll run out of storage space. Another problem is that normally you don't care about all the log messages. Typically people want to tail the logs (check the last N or the last 5 minutes of log messages) or check a specific time range close to a particular event (e.g. what happened in the last hour before the server ran out of memory).

Let's try version 2:

CREATE TABLE log_central( server_ip TEXT, timestamp TIMESTAMP, log_message TEXT, PRIMARY KEY ((server_ip, timestamp)));

Let's insert some data to understand what's going on:

INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('111.222.333.444', '2014-12-11 00:17:00', 'something wrong'); INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('111.222.333.444', '2015-01-11 00:19:22', 'something wrong'); INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('555.666.777.888', '2015-01-11 00:20:33', 'something wrong'); INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('111.222.333.444', '2015-01-11 00:25:17', 'something wrong');

Now the primary key has a compound partition key that is both the server_ip and the timestamp, which means that all the log messages from a certain server with the same timestamp have to be on the same replica set. There probably won't be too many log messages from the same server with the exact timestamp, so that solves problem #1.

No more running out of capacity because of too big partitions. But, we lost the ability to query all the messages from one server:

cqlsh:test> select * from log_central where server_ip='111.222.333.444'; Bad Request: Partition key part timestamp must be restricted since preceding part is

Oh, no. You have to specify the entire partition key, which means you need to know (or guess) the timestamp of each log message and then query for it explicitly:

There is definitely something wrong with that! What's the point of having a database if you have to know what's in it to query it?

You could execute a lot of queries for a whole range of time range and then merge the results from each one, but this is both inefficient and inelegant.

Let's add a new column 'month' to the table and update the schema for version 3:

CREATE TABLE log_central( server_ip TEXT, month TEXT, timestamp TIMESTAMP, log_message TEXT, PRIMARY KEY ((server_ip, month)));

Let's insert the data to the new version and add the month as well:

INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('111.222.333.444', '201412', '2014-12-11 00:17:00', 'something wrong'); INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('111.222.333.444', '201501', '2015-01-11 00:19:22', 'something wrong'); INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('555.666.777.888', '201501', '2015-01-11 00:20:33', 'something wrong'); INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('111.222.333.444', '201501', '2015-01-11 00:25:17', 'something wrong');

Now, the partitioning is at a much more coarse-grained level. The value of each month column is a string in the format 'YYYYMM', which represents the year and month of the message's timestamp. All the messages from one server over a whole month are grouped together on the same machine (replica set).

Let's check the data in the table:

What happened? There are only 3 rows instead of 4. Welcome to another Cassandra surprise and another deviation from the cozy relational model. An insert is always an insert. This means that if a row already exists with the same primary key then it is simply replaced by an insert with the same primary key.

Clustering Columns and Range Queries

Enter clustering keys. As I mentioned earlier, the Cassandra primary key is composed of two components. So far, we have used only the partitioning key component. Let's add a clustering column and explain what's going on.

Version 4:

CREATE TABLE log_central( server_ip TEXT, month TEXT, timestamp TIMESTAMP, log_message TEXT, PRIMARY KEY ((server_ip, month), timestamp));

The timestamp is added as a clustering key, so the entire primary for the purposes of distinguishing between rows includes the server_ip, the month and the timestamp messages from the same server in the same month are still considered different if they have a different time stamp.

The insert commands are the same, but now all the data is in the database:

And, you can finally do month queries:

But, that's not all! For clustering keys you can finally use range queries. For example to query all the log messages from a server in a particular hour you can run this:

This means that although the partitioning key is at whole month granularity, adding the timestamp as a clustering key allows for fine-grained querying within each month. This is where Cassandra shines. If you model and partition your data properly you'll get lightning fast distributed queries.

Note, that if you want to query data from multiple months on the same server you will have to execute multiple queries -- one for each (server, month) combination and merge the results yourself later. The same applies of course to data from multiple servers at the same time.

Data Modeling with Cassandra

So, despite the familiar SQL-like facade of Cassandra CQL the design goals and architecture of Cassandra requires a totally different modeling approach. With relational databases it is all about normalization as a default. That allows running arbitrary ad-hoc queries including joins between tables with a few sprinkled indexes here and there to make it run fast. People may denormalize their data as an optimization technique, but, in general, you can query anything anyway you want. Much of this school of thought is applicable when all the data is hosted on a local disk.

Cassandra is very different -- it was designed from the get go as a distributed database and every major design decision is impacted from the reality of multiple nodes interacting over a network connection.

As you saw earlier, coming up with the right schema requires understanding Cassandra's internals (and I've only scratched the surface), understanding the statistics of your data and the query access patterns.

In the last example, the schema I ended up with allows efficient querying of finely grained time-ranged data for a single server. But, what if you run an enormous elastic cloud where servers die and get replaced and server ips are not stable? In this scenario it would be difficult to manage without additional metadata to keep track of server ips and not trivial to specify the partition key for queries.


Cassandra CQL talks the SQL talk (syntax), but doesn't walk the SQL walk (semantics). The familiar SQL terms of table, row and primary key are used in CQL, but they don't mean the same thing. Queries are significantly more limited. Cassandra is an amazingly powerful database, but with great power comes great responsibility. In this case, your responsibility is to dive deep and really understand Cassandra and the use cases for which it is best suited. Don't let the shallow SQL pretense of Cassandra CQL confuse you.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date