Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Hadoop in Practice: Using Sqoop to Import Data from MySQL

Hadoop in  Practice author Alex Holmes explains how you can use the Sqoop relational database import / export system with Hadoop.


advertisement

Log data has long been prevalent across all our applications, but with Hadoop came the ability to process that data. Various systems produce log data, from network devices and operating systems to web servers and our own applications. They all offer the potential for valuable insights into how our systems and applications operate as well as how they're used. What unifies log files is that they tend to be in text form and line-oriented, making them easy to process. In this article based on chapter 2 of Hadoop in Practice, author Alex Holmes shows how Sqoop could be used for relational database ingress.

Sqoop is a relational database import and export system. It was created by Cloudera, and is currently an Apache project in incubation status.

When you perform an import, Sqoop can write to HDFS, Hive, and HBase, and, for exports, it can do the reverse. Importing is broken down into two activities: connecting to the data source to gather some statistics, and then firing off a MapReduce job, which performs the actual import. Figure 1 shows these steps.



Hadoop Sqoop

Figure 1. Sqoop import overview

Sqoop has the notion of connectors, which contain the specialized logic to read and write to external systems. Sqoop comes with two classes of connectors: a common connector for regular reads and writes, and a "fast" connector that uses database-proprietary batch mechanisms for efficient imports. Figure 2 shows these two classes of connectors and the databases that they support.

Hadoop Sqoop

Figure 2. Sqoop connectors

In this technique, we will look at how to use Sqoop as a simple mechanism to bring relational data into Hadoop clusters. We'll walk through the process of importing data from MySQL into Sqoop. We'll also cover methods for using the regular connector, as well as how to do bulk imports using the fast connector.

Problem

You want to load relational data into your cluster and ensure your writes are efficient and at the same time idempotent.

Solution

Before you can get started, you'll need to install Sqoop. The first Sqoop command will be a basic import, where you'll specify connection information about your MySQL database, and the table you want to export.

$ sqoop import --username hip_sqoop_user --password password \

  --connect jdbc:mysql://localhost/sqoop_test --table stocks
MySQL table names

MySQL table names in Linux are case sensitive. Make sure that the table name you supply in the Sqoop commands is also case sensitive.

It's generally not a good idea to have database passwords as arguments to a script because it allows other users to see your password using commands, such as ps, when the import is occurring. It'll also enter your shell history file. A best practice to follow is to write the password in Sqoop's option file and ensure that only you have read permissions on it.

$ cat > ~/.sqoop_import_options.txt << EOF

import

--username

hip_sqoop_user

--password

password

EOF

$ chmod 700 ~/.sqoop_import_options.txt

Sqoop also supports a -P option, which when present will result in you being prompted for the password.

Run the command again, this time specifying the options file you've created.

$ hadoop fs -rmr stocks

$ sqoop --options-file ~/.sqoop_import_options.txt \

  --connect jdbc:mysql://localhost/sqoop_test --table stocks

You may wonder why you had to delete the stocks directory in HDFS before re-running the import command. Sqoop by default uses the table name as the destination in HDFS for the MapReduce job that it launches to perform the import. If you run the same command again, the MapReduce job will fail because the directory already exists. Let's take a look at the stocks directory in HDFS.

$ hadoop fs -ls stocks

624 2011-11-24 11:07 /user/aholmes/stocks/part-m-00000

644 2011-11-24 11:07 /user/aholmes/stocks/part-m-00001

642 2011-11-24 11:07 /user/aholmes/stocks/part-m-00002

686 2011-11-24 11:07 /user/aholmes/stocks/part-m-00003



$ hadoop fs -cat stocks/part-m-00000

1,AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75

2,AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84

3,AAPL,2007-01-03,86.29,86.58,81.9,83.8,44225700,83.8

...

Import data formats

Sqoop has imported your data as comma-separated text files. It supports a number of other file formats, which can be activated with the arguments listed in table 1.

Table 1. mSqoop arguments that control the file formats of import commands

Argument

Description

--as-avrodatafile

Data is imported as Avro files.

--as-sequencefile

Data is imported as Sequence Files.

--as-textfile

The default file format, with imported data as CSV text files.

If you are importing large amounts of data, you may want to use a file format such as Avro, which is a compact data format, and use it in conjunction with compression. The following example uses the Snappy compression codec in conjunction with Avro files.

$ hadoop fs -rmr stocks

$ sqoop --options-file ~/.sqoop_import_options.txt \

        --as-avrodatafile \

        --compress \

        --compression-codec
org.apache.hadoop.io.compress.SnappyCodec\

        --connect jdbc:mysql://localhost/sqoop_test \

        --table stocks

Note that the compression that's supplied on the command line must be defined in the config file under core-site.xml the property io.compression.codecs. The Snappy codec requires you to have the Hadoop native libraries installed.

We can introspect the structure of the Avro file to see how Sqoop has laid out the records by using an Avro dumper tool that we created. Sqoop uses Avro's GenericRecord for record-level storage. If we run our generic Avro dumper utility against the Sqoop-generated files in HDFS, we'll see the following:

$ bin/run.sh com.manning.hip.ch3.avro.AvroGenericFileDumper \

  stocks/part-m-00000.avro

{"id": 1, "symbol": "AAPL",
"quote_date": "2009-01-02",

 "open_price": 85.88, "high_price": 91.04,
"low_price": 85.16,

 "close_price": 90.75, "volume": 26643400,
"adj_close_price": 90.75}

...

Using Sqoop in conjunction with Sequence Files

One of the things that make Sequence Files hard to work with is that there isn't a generic way to access data in the Sequence File. You must have access to the Writable class that was used to write the data. In Sqoop's case, it code-generates this file. This introduces a major problem: if you move to a newer version of Sqoop and that version modifies the code generator, there's a good chance your older, code-generated class won't work with Sequence Files generated with the newer version of Sqoop. You'll either need to migrate all of your old Sequence Files to the new version or somehow maintain them. Due to this restriction, we don't recommend using Sequence Files with Sqoop. If you are looking for more information on how Sequence Files work, run the Sqoop import tool and look at the stocks.java file that is generated within your working directory.

In reality, we'll more likely want to periodically import a subsection of our tables based on a query. But, what if you want to import all of the Apple and Google stocks in 2007 and stick them into a custom HDFS directory? The following listing shows how you would do this with Sqoop.

$ hadoop fs -rmr 2007-stocks

$ GLOBIGNORE=*                                                #1

$ read -d '' query << "EOF"

select * from stocks

where symbol in ("AAPL", "GOOG")

  and quote_date between "2007-01-01" AND
"2007-12-31"

  AND $CONDITIONS                                                #2

EOF



$ sqoop --options-file ~/.sqoop_import_options.txt \

  --query "$query" \

  --split-by id \                                                #3

  --target-dir /user/aholmes/2007-stocks \

  --connect jdbc:mysql://localhost/sqoop_test

#1 -- Bash by default performs globbing, meaning that it'll expand wildcards like "*". We use this command to turn this off so that the next line generates the SQL correctly.

#2 -- Store our query in variable "query." The $CONDITIONS is a Sqoop macro that must be present in the WHERE clause of the query. It is used by Sqoop to substitute LIMIT and OFFSETOFFSET options when issuing MySQL queries.

#3 -- This argument must be supplied so that Sqoop can determine which table column to use for splitting.

The SQL shown in the previous listing can also --query be used to include only a subset of the columns in a table to be imported.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap