Cassandra is a fantastic NoSQL database. It was designed with Big Data in mind. When I say big data I simply mean data that doesn’t fit on one machine. Cassandra is fast, robust (no single point of failure), highly available and linearly scalable database with multi-data center support. It achieves all that by having laser focus and carefully crafting the features it supports–and just as importantly–the features it doesn’t support.
Surely Cassandra can do that too?
cqlsh:test> select * from messages where id < 5;Bad Request: Only EQ and IN relation are supported on the partition key (unless you use the token() function)
Nope. Not really. With Cassandra you can almost get that by specifying a list of ids:
Cassandra Partition Key and Clustering Key
There is much more to Cassandra's primary keys. First of all, unlike relational databases, there must be a primary key, but that's not that unusual because every relational database I have ever used always defined a primary key for each table as a best practice. But, Cassandra's primary keys are actually composed of two separate components: the partition key and the clustering columns. Each one of these components can be comprised of multiple columns. All the rows with the same partition key live on the same machine (well, there is replication, so all rows with the same partition key lives on each one of its replica set machine).
That has very important implications for how you should select your partition key. If too many rows have the same partition key and exceed the storage capacity of their designated machine you're in trouble even if the cluster as a whole has plenty of capacity left. Let's create another table that stores server log messages and evolve the Cassandra schema. Here is version 1:
CREATE TABLE log_central( server_ip TEXT, log_message TEXT, timestamp TIMESTAMP, PRIMARY KEY (server_ip);
The primary key here is the server ip address. We can query by the ip address and get all the messages from each server:
select * from log_central where server_ip='111.222.333.444'
That's pretty good, but there are two problems. First, a server may generate a lot?of log messages, so that eventually it may exceed the capacity of the machine and we'll run out of storage space. Another problem is that normally you don't care about all?the log messages. Typically people want to tail the logs (check the last N or the last 5 minutes of log messages) or check a specific time range close to a particular event (e.g. what happened in the last hour before the server ran out of memory).
Let's try version 2:
CREATE TABLE log_central( server_ip TEXT, timestamp TIMESTAMP, log_message TEXT, PRIMARY KEY ((server_ip, timestamp)));
Let's insert some data to understand what's going on:
INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('111.222.333.444', '2014-12-11 00:17:00', 'something wrong');INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('111.222.333.444', '2015-01-11 00:19:22', 'something wrong');INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('555.666.777.888', '2015-01-11 00:20:33', 'something wrong');INSERT INTO log_central (server_ip, timestamp, log_message) VALUES ('111.222.333.444', '2015-01-11 00:25:17', 'something wrong');
Now the primary key has a compound partition key that is both the server_ip and the timestamp, which means that all the log messages from a certain server with the same timestamp have to be on the same replica set. There probably won't be too many log messages from the same server with the exact timestamp, so that solves problem #1.
No more running out of capacity because of too big partitions. But, we lost the ability to query all the messages from one server:
cqlsh:test> select * from log_central where server_ip='111.222.333.444';Bad Request: Partition key part timestamp must be restricted since preceding part is
Oh, no. You have?to specify the entire partition key, which means you need to know (or guess) the timestamp of each log message and then query for it explicitly:
There is definitely something wrong with that! What's the point of having a database if you have to know what's in it to query it?
You could execute a lot of queries for a whole range of time range and then merge the results from each one, but this is both inefficient and inelegant.
Let's add a new column 'month' to the table and update the schema for version 3:
CREATE TABLE log_central( server_ip TEXT, month TEXT, timestamp TIMESTAMP, log_message TEXT, PRIMARY KEY ((server_ip, month)));
Let's insert the data to the new version and add the month as well:
INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('111.222.333.444', '201412', '2014-12-11 00:17:00', 'something wrong');INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('111.222.333.444', '201501', '2015-01-11 00:19:22', 'something wrong');INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('555.666.777.888', '201501', '2015-01-11 00:20:33', 'something wrong');INSERT INTO log_central (server_ip, month, timestamp, log_message) VALUES ('111.222.333.444', '201501', '2015-01-11 00:25:17', 'something wrong');
Now, the partitioning is at a much more coarse-grained level. The value of each month column is a string in the format 'YYYYMM', which represents the year and month of the message's timestamp. All the messages from one server over a whole month are grouped together on the same machine (replica set).
Let's check the data in the table:
What happened? There are only 3 rows instead of 4. Welcome to another Cassandra surprise and another deviation from the cozy relational model. An insert is always an insert. This means that if a row already exists with the same primary key then it is simply replaced by an insert with the same primary key.
Clustering Columns and Range Queries
Enter clustering keys. As I mentioned earlier, the Cassandra primary key is composed of two components. So far, we have used only the partitioning key component. Let's add a clustering column and explain what's going on.
CREATE TABLE log_central( server_ip TEXT, month TEXT, timestamp TIMESTAMP, log_message TEXT, PRIMARY KEY ((server_ip, month), timestamp));
The timestamp is added as a clustering key, so the entire primary for the purposes of distinguishing between rows includes the server_ip, the month and the timestamp messages from the same server in the same month are still considered different if they have a different time stamp.
The insert commands are the same, but now all the data is in the database:
And, you can finally do month queries:
But, that's not all! For clustering keys you can finally use range queries. For example to query all the log messages from a server in a particular hour you can run this:
This means that although the partitioning key is at whole month granularity, adding the timestamp as a clustering key allows for fine-grained querying within each month. This is where Cassandra shines. If you model and partition your data properly you'll get lightning fast distributed queries.
Note, that if you want to query data from multiple months on the same server you will have to execute multiple queries -- one for each (server, month) combination and merge the results yourself later. The same applies of course to data from multiple servers at the same time.
Data Modeling with Cassandra
So, despite the familiar SQL-like facade of Cassandra CQL the design goals and architecture of Cassandra requires a totally different modeling approach. With relational databases it is all about normalization as a default. That allows running arbitrary ad-hoc queries including joins between tables with a few sprinkled indexes here and there to make it run fast. People may denormalize their data as an optimization technique, but, in general, you can query anything anyway you want. Much of this school of thought is applicable when all the data is hosted on a local disk.
Cassandra is very different -- it was designed from the get go as a distributed database and every major design decision is impacted from the reality of multiple nodes interacting over a network connection.
As you saw earlier, coming up with the right schema requires understanding Cassandra's internals (and I've only scratched the surface), understanding the statistics of your data and the query access patterns.
In the last example, the schema I ended up with allows efficient querying of finely grained time-ranged data for a single server. But, what if you run an enormous elastic cloud where servers die and get replaced and server ips are not stable? In this scenario it would be difficult to manage without additional metadata to keep track of server ips and not trivial to specify the partition key for queries.
Cassandra CQL talks the SQL talk (syntax), but doesn't walk the SQL walk (semantics). The familiar SQL terms of table, row and primary key are used in CQL, but they don't mean the same thing. Queries are significantly more limited. Cassandra is an amazingly powerful database, but with great power comes great responsibility. In this case, your responsibility is to dive deep and really understand Cassandra and the use cases for which it is best suited. Don't let the shallow SQL pretense of Cassandra CQL confuse you.