Big data is an aptly named concept. Nowadays, huge amounts of data and information are generated by hundreds of millions of applications, devices and human beings all round the world. Be they user's personal data maintained by social networking sites, public sites hosting blogs, weather-related data generated by different types of sensors, or customer and product information maintained by a large enterprise organization -- they are all contributing to big data.
The amount of data sets and the volume of information being generated, processed and analyzed particularly for business intelligence and decision making is growing rapidly as well, making traditional warehousing solutions expensive, difficult to leverage and mostly ineffective. There clearly is a need for a generalized, flexible and scalable tool that can cope with the challenges of big data.
When developers deal with big data, the challenge generally boils down to either:
- storing the data in a way so that it is easy to access and manage
- processing the whole set of data in a way that makes the processing easy, efficient and faster
Enter Hadoop for Big Data
The Apache Hadoop is a popular software framework that supports data-intensive distributed applications. It is a map-reduce implementation inspired by Google's MapReduce and Google File System (GFS) papers. Hadoop is written in Java, and many large-scale applications and organizations (e.g. Yahoo and Facebook) use it to run large distributed computations on commodity hardware. It is designed in a way that it can scale up from a single server node to thousands of machines, each offering local computation and storage. All these features and capabilities make Hadoop the best candidate for developers dealing with big data.
However, the map-reduce programming used by Hadoop is very low level and Hadoop lacks the SQL-like expressiveness of query languages, which forces developers to spend a lot of time writing programs for even simple analyses. Hadoop also is not easy for those developers who are not familiar with the map-reduce concept. It requires them to write custom programs for each operation, even for simple tasks like getting the number of rows from a log or averages over some columns. Even the code generated is hard to maintain and reuse for different applications.
Apache Hive, a NoSQL type of database system that runs over Hadoop, overcomes these limitations and provides a neat and simple interface. Hive efficiently meets the challenges of storing, managing and processing big data, which is difficult with traditional RDBMS solutions.