Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Getting Started with Apache HBase

Apache HBase is a distributed, non-relational and open source database written in Java that runs on top of HDFS. HBase is a suitable candidate when you have hundreds of millions or billions of rows and enough hardware to support it. Learn more about it's practical use and architectural concepts.


advertisement

Overview

Apache HBase can be defined as the Hadoop database. It is a distributed, non-relational and open source database written in Java. It is developed based on the Google BigTable framework and runs on HDFS (Hadoop distributed file system). Apache HBase is used when you have a requirement of random, real time access to your large volume of data. HBase is a suitable candidate when you have hundreds of millions or billions of rows and enough hardware to support it, as HBase is based on HDFS and HDFS performs well when there is minimum 5 data nodes. In short, HBase is a 'data warehouse' type framework that is distributed and suitable for processing a large volume of data. This article will explain the details with architectural concepts.

Introduction

Apache HBase is a NoSQL column oriented database management system that runs on top of HDFS. HBase does not support structured query languages such as SQL. HBase applications are all MapReduce tasks and written in Java. HBase supports applications written in REST, Thrift and Avro. Some of the important features in HBase are listed below:

  • HBase supports automatic sharding
  • HBase supports HDFS as distributed storage
  • HBase supports MapReduce for parallel processing of huge volume of data
  • HBase has support for Java client APIs
  • HBase supports strongly consistent read and write operations. It is suitable for high speed counter aggregation

Difference between HBase and HDFS



We have described that HBase is based on top of HDFS. So you might be concerned that HDFS and HBase are similar. You need to remember that HDFS is not a simple file system, rather it is a distributed storage system suitable for storing large volume of data. HDFS does not support fast record look up for large volumes of data, but HBase works on top of HDFS and provides that missing fast look up and update.

When should you use HBase?

HBase is a typical NoSQL and columnar data store. Selection of a NoSQL database and RDBMS depends upon the requirements of the application. First, we should understand the requirement clearly and then select the database. If you just select an NoSQL DB without proper analysis, then it might cause trouble. And it might also be a misuse of technology and resources. The following are some points that should be considered for selecting a NoSQL DB like HBase.

Volume: The volume of data is the first criteria for selecting a NoSQL DB. You should have endless data (millions or billions of rows) to process and store. If you only have a few thousands or millions of rows, then a traditional RDBMS is the better fit. If you select HBase for a small amount of data then the data will accumulate in a single node and the other nodes in the cluster will sit idle.

Hardware support: HDFS performs efficiently when there are a minimum of five data nodes. As we know that HBase is based on HDFS, so you should have sufficient hardware support for implementing HBase.

No need for RDBMS features: Make sure that your application does not require extra features provided by a typical RDBMS. Advanced features such as transaction, complex query, triggers are not supported by HBase. So this is another important criterion for selection.

HBase Design Concepts

The design concepts behind HBase are similar to HDFS and MapReduce framework. As all work in a distributed environment, the general design is based on master-slave architecture. HDFS works on NameNode and slave nodes, MapReduce works on JobTracker and TaskTracker slaves. Similarly HBase has the following master slave architecture.

  • Master node manages the cluster
  • Region servers stores table data and work on the data

As the master node is the main controller, HBase is very sensitive to the loss of its master node.

HBase Views

HBase has a tabular view for storing data. The main concept is based on column family. The HBase table is made of rows, columns and each column belongs to a column family. The table row key is the primary key for table access. The row key can be anything and the rows are sorted by row key. Following are the two views that describe the concepts.

Conceptual View

In this section I will explain the conceptual view with an example. The table contains column families and column families contain columns. The convention is that a column is made of three parts: column family name, prefix and column name. The colon character (:) delimits the column family and column. For example, the table name is 'hbasetable' having two column families 'colfamily1' and 'colfamily2'. The 'colfamily1' has two columns 'name' and 'address'. The 'colfamily2' has one column 'telno'. So the structure would be as shown below.

Table 'hbasetable'

colfamily1: name = "Ricardo"

colfamily1: address = "MA, USA"

colfamily2: telno = "2235678"

The tabular view will look like this:


Table1: Tabular view of 'hbasetable'

Physical View

We have already discussed the conceptual view of HBase table and its contents. But the physical view is a bit different. Physically, the HBase tables are stored on a column family basis. So the new columns can be added easily without any prior notification. This feature adds the flexibility of linear scalability that we discussed earlier.

Following are tabular view of two column families.


Table2: Showing colfamily1


Table3: Showing colfamily2

Please note that the empty cells displayed in the conceptual view are not actually stored. The storing is only allowed for a column oriented storage structure. So if we query some data at particular time stamp 'T1' from 'colfamily1', then it would return nothing. The same is true for 'colfamily2' and all time stamps are stored in a descending order. As a result, the most recent value from a particular column would be returned if no time stamp is mentioned in the query.

Conclusion

Before concluding the discussion, we should keep in mind that HBase is an open source, NoSQL, distributed database suitable for storing and processing endless amount of data. It is developed under the Apache Hadoop project and based on the HDFS framework. HBase operations are all MapReduce tasks that run in a parallel manner. The basic concept is the same as Google's BigTable. The selection of NoSQL database should be done carefully. The RDBMS design and NoSQL design are completely different, so porting data from RDBMS to HBase is not possible. The entire design has to be changed to shift from RDBMS to NoSQL HBase.

 

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.



   
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap