Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Exploring Apache Shark

Learn more about Apache Shark and its features.


advertisement

Apache Shark is a distributed query engine developed by the open source community. This query engine is mainly used for Hadoop data and it provides enhanced performance and high-end analytical results to Hive users.

This article will discuss Apache Shark and its features in detail.

Introduction

Apache Shark is a data warehouse based system used with Apache Spark. This is designed to be compatible with Apache Hive. Shark has the ability to execute HIVE QL queries up to 100 times faster than Hive without making any change in the existing queries. Shark supports most of the Hive's features, such as query language, metastore, serialization formats and user-defined functions. Hence it makes the integration of existing Hive deployments easier.

Important Features of Apache Shark

Apache shark comes up with the following important features:

  • Faster Execution Engine – Apache Shark is built on top of Apache Spark which is a parallel data execution engine. Even if the data is on the disk, because of the faster execution engine, Shark is relatively fast compared with competitors. Shark avoids the overhead of Hadoop Map Reduce. With its faster engines, Shark can respond to complex queries in sub-second latency.
  • Column Wise Memory Store – Data analysis mechanism focuses on a smaller set of data, e.g. it can be time based or locality based. Thus, you need to touch only a small set of dimension tables or a certain portion of the fact tables. These queries execute only within temporal locality. This enables it to fit the working set into cluster's memory.


    As a user, you can use this temporal locality by storing the working set of data within the cluster's memory, or in a database by having in-memory materialized views. The commonly used data types can also be cached in columnar format, such as primitive arrays, which are very efficient for data storage and garbage collection. This provides maximum performance, as the data is fetched from the tables and not from the disk.

Setup and Execute Locally

Prerequisite – Before setting up Shark on your computer, make sure you have the following installed on your system:

The binary distribution of Shark can be downloaded from the official website of github amplab. The binary package contains two folders:

  • shark-0.8.0
  • hive-0.9.0-shark-0.8.0-bin

You need to set up the following environment variables in order to do the setup:

  • JAVA_HOME
  • HIVE_HOME
  • SCALA_HOME

Shark comes with a template env file – shark-env.sh.template. Make a copy of this template file in the location – shark-0.8.0/conf. The name of the env file should be shark-env.sh. Once the environment variables are created, you need to create the default HIVE warehouse directory. This is the location where HIVE stores the table data for native tables. While creating this directory, make sure that the owner of these directories are same as those of the Shark setup.

Now you are ready with our Shark setup. Run the following command:

Listing 1 – Starting up Shark Command Line Interface

./bin/shark

In order to verify that Shark is up and running, run the following example that creates a table with some sample data:

Listing 2 – Sample Code to Create a Simple Table and then Load Some Data

CREATE TABLE SOURCE_MAP (key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE SOURCE_MAP;
SELECT COUNT(1) FROM src;
CREATE TABLE SOURCE_MAP_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM SOURCE_MAP_cached;

In addition to the Shark command above, you have several other executables:

  • bin/shark-withdebug – This runs the Shark command line interface with debug level logs printed on the console.
  • bin/shark-withinfo – This runs the Shark command line interface with info level logs printed on the console.

Following the steps mentioned above, you can setup Shark on a single node. Follow these steps in order to run Shark on a cluster.

Prerequisite – Before setting up Shark on your computer make sure you have the following installed on your system:

Unlike the earlier versions of Shark and Spark, the latest version no longer requires Apache Mesos.

First let's make some changes in the Spark environment:

  • Master slave entry – The Spark slaves file - spark-0.8.0/conf/slaves needs to be modified to add the host name of each slave. It should be single line entry per slave.
  • Spark env file – The Spark env file - spark-0.8.0/conf/spark-env.sh needs to have the following entries:
    • SCALA_HOME – as explained above.
    • SPARK_WORKER_MEMORY – This is the maximum amount of memory that Spark can use on every single node. While setting this parameter you must be careful and be sure to leave at least 1 GB memory for the OS to function properly.

Now let us make the setting related changes in the Shark environment:

As mentioned above, download the binary distribution of Shark from the official website of github amplab. The binary package contains two folders:

  • shark-0.8.0
  • hive-0.9.0-shark-0.8.0-bin

Now open the hark-env.sh file and edit the following properties as per our environment:

  • JAVA_HOME
  • HIVE_HOME
  • SCALA_HOME
  • MASTER environmental values

The master URL should exactly match with the spark:// URI mentioned at port 8080 of the standalone master. The shark-env.sh file should look like the code below:

Listing 3 – Shark env File in case of Clustered Setup

HADOOP_HOME=/path/to/hadoop
HIVE_HOME=/path/to/hive
MASTER=
SPARK_HOME=/path/to/spark
SPARK_MEM=16g
source $SPARK_HOME/conf/spark-env.sh

The last line added here is to avoid duplicate entries for SPARK_HOME. Once these parameters are added make sure to export them using the standard export command of UNIX. It must be noted that the amount of memory mentioned under parameter – SPARK_MEM should not be higher than the SPARK_WORKER_MEMORY mentioned above. If you want to use Shark with an existing HIVE setup, make sure to set the HIVE_CONF_DIR parameter in the shark-env.sh file.

The next step is to copy the Spark and Shark directories to the slaves. Once done, you can start the cluster by executing the following command:

Listing 4 – Launch the Spark Cluster

/bin/start-all.sh

The Shark Query Language

Shark has its own subset of SQL which is very much close to the query language implemented by HIVE. For example, to create a cached table from the rows of an existing table, you need to set the shark.cache table property as shown below:

Listing 5 – A Sample Shark Query

CREATE TABLE ... TBLPROPERTIES ("shark.cache" = "true") AS SELECT ...

You can also extend HiveQL to have a shortcut for this syntax. Simply append '_cached' to the table name while using CREATE TABLE AS SELECT, and the table will be cached in the memory.

Summary

  • Apache Shark is a distributed query engine developed by the open source community
  • Apache Shark is a data warehouse based system to be used with Apache Spark.
  • Apache Shark is compatible with HIVE QL and can be easily integrated with HIVE.
  • It can run on both standalone mode and clustered mode.

 

 

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.



   
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap