Performance Tuning Data Load into Hadoop with Sqoop

Working with hadoop involves working with huge amounts of data. It also, at times, involves moving huge amounts of data from traditional data stores such as mysql and oracle. Apache Sqoop is an excellent tool that aids in migrating data to and from a hadoop cluster.

Data migration into hadoop can become tricky and challenging due to:

  • Data Volume
  • Source machine’s performance
  • Network performance
  • Migration tool’s performance
  • Hadoop cluster’s performance

Before moving huge amounts of data, it is wise to test and fine-tune each part of the data flow to achieve best possible performance.  Any non-optimal part of the flow will add to increased cost of data transfer.

For the purpose of this post, we will be doing test data migrations from a mysql instance to a hadoop cluster. Also, I will be focusing on all of the above concerns but network performance. But we will see that tuning source, sqoop and hadoop yields a huge improvement in data transfer times.

Data Volume:

For the test, we will migrate data from a mysql table of 7GB with different parameters at hadoop and sqoop level.

Source Machine:

The source machine for the test is an AWS RDS instance with provisioned IOPS. The RDS instance is provisioned for 10000 IOPS with 1 TB of storage. This instance is having close to 600+ GB of data in mysql tables. It is important to note here that the I/O speed of the source system would be critical in overall speed of the data migration. If this machine is acting slow, no matter how fast other components of the data flow are, things would remain slow.

Hadoop Cluster:

The hadoop cluster is a very small test cluster.

  • It has 1 Name Node and 2 Data Nodes.
  • Each data node has 2 /data directories with 400 GB each.
  • Each instance is EC2 m1.large box with 7.5 GB of RAM.
  • Sqoop is installed on one of the data nodes.
  • Hadoop dfs block size is set to 268435456 i.e. 256 MB.
  • Each data node has been set to have 2 mappers.
  • Data node JVM is set to 550 MB.

For specific tests, we will use tweak certain settings.  And this is our sqoop command with additional options:

sqoop import --connect jdbc:mysql://foo-bar-server.eu-west-1.rds.amazonaws.com/FooBarDatabase --table FooBarTable --username user --password g00DPassword --hive-import  

Test 1:

Load data with no changes to above settings on each data node:

Mappers: 2, Reducers: 1, mapred.child.java.opts: 550 MB

The data is loaded in 17 mins.

Test 2:

Load data with below changes in settings on each data node:

Mappers: 4, Reducers: 1, mapred.child.java.opts: 550 MB

Run sqoop command with –m 8 option.

The data is loaded in 12 mins.

Increasing number of mappers does improve the run time. But there is a limit on how many mappers we can have on a single data node. To improve the speed further, we may need to add more data nodes to the cluster.

Test 3:

Load data with below changes in settings on each data node:

Mappers: 4, Reducers: 1, mapred.child.java.opts: 550 MB, use –direct in sqoop command

The data is loaded in 5 mins and 54 secs.

With the same infrastructure and same hadoop settings, we now have halved the run time with sqoop’s –direct option that internally works with mysql_dump.

Test 4:

Load data with below changes in settings on each data node:

Mappers: 4, Reducers: 1, mapred.child.java.opts: 550 MB, use –direct in sqoop command, use compression

--compress --compression-codec com.hadoop.compression.lzo.LzopCodec

The data is now loaded in 4mins, 37sec  and additional 20 sec for lzop index generation.

With tweaking hadoop and sqoop parameters we have reduced the run time from 17 odd mins to 4 mins and 30 secs. Now this will save a lot of CPU and data transfer cost on the longer run with huge data amounts.

These times would not hold true at all times, the test cluster used here was not loaded with any other tasks. It also depends on where on the network source and destination are, in this case, everything was on AWS infrastructure.