Home » Managing Large Volumes of Data with Apache Cassandra NoSQL

Managing Large Volumes of Data with Apache Cassandra NoSQL

Overview

Apache Cassandra is one of the most popular and scalable open source NoSQL databases. Cassandra is an ideal database for managing a large volume of unstructured, semi-structured and structured data across multiple data centers and the cloud environment. Cassandra delivers high scalability and availability across many commodity servers without compromising performance. With this model there is no single point of failure, and it provides a powerful data model for maximum flexibility and fast response time. Linear scalability and a fault tolerant hardware or a cloud infrastructure makes a perfect combination for any critical data.

Introduction

Relational databases are very good in solving certain type of data storage problems. But as the focus is different for RDBMS, it creates problem when scaling up for a large volume of data. So, we need to find a way to get rid of the joins. This will result in de-normalizing the data. This will lead to maintaining multiple copies of data and also causes damage to the design, both in the database and in the application. In this condition, solutions provided by NoSQL seem to be less radical and less scary than we may have thought. The design goal of a NoSQL database has to be understood clearly before implementing it in any application.

Design goals of Cassandra NoSQL database

The design goals of a NoSQL database are completely different from relational database. So the choice of using NoSQL DB or RDBMS also depends upon the type of application and its requirement. As we know that ACID transaction provides a strong consistency model for all web applications developed and designed traditionally. But when we think about scalability, it comes at a cost and conflicts with some of the rules followed in RDMBS design. So promoting availability over consistency is one of the key design factors for NoSQL databases. Common design goals followed of Cassandra are stated below.

High performance
Horizontal scalability
Simplicity
Schema flexibility

Cassandra architecture to manage large data volume

As we all know, NoSQL databases are distributed on a number of commodity nodes. Cassandra is also distributed on a number of nodes and it follows ‘masterless’architecture. ‘Masterless’ architecture means that all nodes are the same and there is no single node that controls other nodes. Cassandra automatically distributes data across all of the commodity nodes which form the ‘ring’ known as a database cluster. As the data is automatically and transparently partitioned on the cluster, developers do not need to do anything programmatically. Another important feature of the Cassandra architecture is support for in-built and customizable replication. The redundant data is stored across multiple nodes in the Cassandra ring. If there is any failure in any node, the same data is retrieved from other nodes having replicated data. The replication can be configured in the following ways.

Across one data center
Across multiple data centers
Across multiple cloud infrastructures

Another architectural feature is the support for linear scalability. It means the capacity or scalability can be increased by simply adding new nodes. For example, if 2 nodes can handle 1000 transactions/sec, then 4 nodes will support 2000 transactions/sec and so on. Following picture shows the linear scalability of a Cassandra ring.

Accessing large volume of data

The first thing which comes into mind is the availability of different client libraries when developing a database driven application. For RDBMS products the available libraries are straightforward. For example, JDBC is the standard database access API for Java based applications. Normally there is a single JDBC driver vendor for a particular type of database product. On the other hand, Cassandra has approximately nine different clients for Java application development. And the most important thing is that these clients provide different flavors for managing the data. Some are providing object relational mapping APIs, some are offering CQL based support, etc. So the flexibility for accessing the NoSQL DB is another major advantage for application development. The developers can choose the type of access according to their requirement.

Large volumes of data in Cassandra can be accessed and managed by APIs which follow RPC style. At the same time, Cassandra also provides basic query language support called CQL, which is similar to SQL to some extent. But the application developer must have a sound knowledge about the storage engine and its functionality.

Standard use cases for Cassandra NoSQL DB

As we have already discussed, the standard use case for Cassandra is different from traditional RDBMS applications. Following are some standard use cases.

Applications handling very large data volume
Applications of high scalability and availability
Applications with high reliability requirement for data storage
Dynamic data model which is expected to change significantly over time
Distribution over different datacenters

Downloading and Installing Cassandra

Now let us discuss the download and installation part of Cassandra NoSQL DB. The download and installation will take some time.

Apache Cassandra can be downloaded from http://cassandra.apache.org. The binary distribution is named as apache-cassandra--bin.tar.gz. The easiest way to install Cassandra is outlined in the following steps below:

Download the binary distribution from the above website
Unzip this using some regular ZIP utility
Once unzipped, you should get the following directories:
- bin ? this contains the executables to run Cassandra and the command line interface client.
- conf ? this contains files used to configure Cassandra
- interface – interface is defined using the Thrift syntax and provides an easy way to generate clients. If you want to see all of the operations that Cassandra supports, open this file by using a regular text editor. The file will have all Cassandra supports clients for Java, C++, PHP, Ruby, and Python, Perl, and C # through this interface.
- lib ? This contains the external which are required to execute Cassandra.
- javadoc ? This contains the documentation in html format for Cassandra.

Start the Cassandra NoSQL server

To start the Cassandra server on any OS, such as Linux or Windows, open a command prompt or terminal window. Now go to the /bin where you unpacked Cassandra, and run the following command to start the Cassandra server. If the installation was clean, we would see a log statements such as:

Listing 1: starting Cassandra server

utpalb@Cassandraserver$ bin/cassandra -fINFO 13:23:22,367 DiskAccessMode 'auto' determined to be standard, indexAccessMode is standardINFO 13:23:22,475 Couldn't detect any schema definitions in local storage.INFO 13:23:22,476 Found table data in data directories. Consider using JMX to call org.apache.cassandra.service.StorageService.loadSchemaFromYaml().INFO 13:23:22,497 Cassandra version: 0.7.0-beta1INFO 13:23:22,497 Thrift API version: 10.0.0INFO 13:23:22,498 Saved Token not found. Using qFABQw5XJMvs47lgINFO 13:23:22,498 Saved ClusterName not found. Using Test ClusterINFO 13:23:22,502 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1282508602502.logINFO 13:23:22,507 switching in a fresh Memtable for LocationInfo at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282508602502.log', position=276)INFO 13:23:22,510 Enqueuing flush of Memtable-LocationInfo@29857804(178 bytes, 4 operations)INFO 13:23:22,511 Writing Memtable-LocationInfo@29857804(178 bytes, 4 operations)INFO 13:23:22,691 Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-1-Data.dbINFO 13:23:22,701 Starting up server gossipINFO 13:23:22,750 Binding thrift service to localhost/127.0.0.1:9160INFO 13:23:22,752 Using TFramedTransport with a max frame size of 15728640 bytes.INFO 13:23:22,753 Listening for thrift clients...INFO 13:23:22,792 mx4j successfully loaded HttpAdaptor version 3.0.2 started on port 8081

The -f option used here tells Cassandra to stay in the foreground instead of running as a background process. This helps us, so that all of the server logs will print to standard out and you can see them in your terminal window, which is useful for testing.

In conclusion:

Apache Cassandra is a scalable NoSQL-based database
It can be downloaded and installed from the Apache website
Cassandra is an ideal database for managing large amounts of structured, semi-structured, and unstructured data, across multiple data centers and the cloud.
Cassandra supports linear scalability and high performance across multiple commodity servers with no single point of failure, and provides a powerful dynamic data model designed for maximum flexibility and fast response time.

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.

Charlie Frank

Charlie has over a decade of experience in website administration and technology management. As the site admin, he oversees all technical aspects of running a high-traffic online platform, ensuring optimal performance, security, and user experience.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.