December 13, 2012 · Hadoop HBase Hive Zookeeper

Hbase Hive Integration

While working with hadoop related technologies one touches several tools/frameworks. Once we have a hadoop cluster running, the next thing we want is to update records and have sql-like features. Hbase provides a way to fulfil the former and Hive fulfils the later. But what fun it would be if we stopped there, we want to connect everything. We want to utilise whatever we have done with one technology in slightly better and/or fetaure-rich way. And we want to connect Hbase and Hive. We want to expose the tables (columnar of course) stored in Hbase in Hive so that we can do good old select stuff. Lets see how its done.

Assuming, we have:

All of the above are huge tasks in themselves. Setting each one up is a task. Further we assume that we are dealing with MRv1 , mysql metastore for Hive and are using an external zookeeper for hbase.

Enough assumptions, this how we do it:

Hbase Hive integration is possible because of storage handlers. Storage handlers essentially provide all the behind the scenes functionality to take to a different data store. Different here means anything other than hive itself. So we need storage handlers, which, un-surprisingly would come up as jar files. One would need three jar files on the hive classpath for hbase hive integration to work, we can add them on the hive shell as:

add jar /usr/lib/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar;  
add jar /usr/lib/hive/lib/hbase.jar;  
add jar /usr/lib/hive/lib/zookeeper.jar;  

Now we are ready to define hive tables which would point to hbase tables.

CREATE EXTERNAL TABLE logs(key int, logDate string, log string)  
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH  
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,p:logDate,p:log")  
TBLPROPERTIES ("" = "hbase_logs");  

The STORED BY clause specifies a Hbase storage handler class which does the magic for us. Now if we get errors like:

error : column family - not defined  

that means we do not have a column family defined and if we are sure that we have defined one then its a case of /n/r in the create table statement. Removing all such newline characters would make the statement work. Fire a show tables command to list the newly created table:

hive>show tables;  

Now if this simple command or any other previous hive command results in error like below:

org.apache.commons.dbcp.SQLNestedException: Cannot get a connection,pool error Could not create a validated object,  
cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection  

means, most probably, you are using derby as your meta store and had run hive previously with root user (or a user with better permissions than the current user). Doing so creates a lock file which for some reason does not get removed on hive session termination. The solution is to remove that lock file:

cd /var/lib/hive/metastore/metastore_db (your meta store location) rm *.lck  

Now test if we can get data out with a hive query:

select * from logs LIMIT 10;  

In all possibilities this would work. A better test is to fire a query that invokes some map reduce jobs. For an example,

select log from logs LIMIT 10;  

Now this could fail and it may give several different reasons on the hive console saying why it failed. One error can be:

Exception in thread "Thread-27" java.lang.NullPointerException at org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(  

This is where all the pain starts. Now to debug this, one has to look into three layers: hive, hbase and hadoop! We are not sure what is wrong and at what level. May be hbase master had an error or may be region server, may be hadoop job tracker or namenode or the datanode?! or something else. Anything.
Best way to approach this is to have a look at all the logs starting from top towards bottom, and starting from masters towards slaves. On investigating logs i found one error in the hadoop jobtracker’s log file:

INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_201212131135_0013_m_000000_3: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@78c08593 closed  

This suggests something wrong with hbase client while the job is trying to connect to the hbase cluster. On further investigation i found errors in hadoop job‘s syslog file on the data node where the job was scheduled:

INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server Will not attempt to authenticate using SASL (unknown error)  
WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect Connection refused  

Since this error is on the datanode, its clear that the datanode is trying to connect to a zookeeper instance on its localhost and its failing as there is no zookeeper instance running on the datanode! And this is the real problem not the NullPointerException we got on the hive shell. Now, how do we tell hive to pass the correct zookeeper address to the hadoop job its forking? We need to configure that in the hive-sites.xml file or pass on the command line:

Command line:  
hive -hiveconf hbase.zookeeper.quorum=your-zookeeper-machine-address  

And this should solve the problem. Another common error is to run an external zookeeper and forget to change hbase configuration to not to fork its own zookeeper. In file one needs to turn internal zookeeper off before using an external one.

 export HBASE_MANAGES_ZK=false