Creating a hadoop cluster is a task. It becomes all the more challenging because of ever changing/evolving nature of technologies involved. There was Apache Hadoop then we had Cloudera Hadoop and now we have very many (HotronWorks, MapR etc). Regardless of the flavor one chooses, getting a hadoop cluster going would have some surprises and headaches.

One can install a cluster in primarily 3 ways:

  • Start with apache hadoop tarballs and do it yourself
  • Start with CDH tarballs and do it yourself or use Cloudera Manager.
  • Use systems like Puppet, Chef or Whirr

The first approach is good if one wants to learn and see as to what really goes behind the scenes. Second approach works well and its mostly clicking buttons on a web UI (but that does not mean you are done after the last click, things may still not work).

The third approach gives, in theory, everything as second approach less the nice web UI. Puppet and Chef have been around for some time now and they are widely used configuration management systems. Apache Whirr is the new kid on the block. It utilizes puppet, chef and many more systems behind the scenes.

I chose Whirr after glancing through Puppet for a while. Puppet and Chef would require one to learn certain things. With Whirr one need not. Define a configuration file and invoke whirr from command line and voila, you are done…..well, in theory. Lets see how it went with me.I chose:

  • Whirr 0.8.1
  • AWS
  • Ubuntu 12.04
  • CDH4 (Yarn)

Whirr can be installed on one’s local machine or on a cloud instance. We will see how whirr comes up with proxy settings so that one can launch map reduce jobs from the machine hadoop cluster was created. To get started one would do the following:

  • Download and install whirr
  • Set up SSH keys to be used with cluster setup - ssh-keygen -t rsa

With whirr one defines a properties file for the cluster configuration, for an example:

whirr.cluster-name=yourclustername  
whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver, 2 hadoop-datanode+yarn-nodemanager  
whirr.provider=aws-ec2  
whirr.identity=your-aws-key  
whirr.credential=your-aws-secret  
whirr.private-key-file=${sys:user.home}/.ssh/whirr_id_rsa  
whirr.public-key-file=${sys:user.home}/.ssh/whirr_id_rsa.pub  
whirr.env.mapreduce_version=2  
whirr.env.repo=cdh4  
whirr.hadoop.install-function=install_cdh_hadoop  
whirr.hadoop.configure-function=configure_cdh_hadoop  
whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory  
whirr.yarn.configure-function=configure_cdh_yarn  
whirr.yarn.start-function=start_cdh_yarn  
whirr.hardware-id=m1.large  
whirr.image-id=eu-west-1/ami-81c5fdf5  
whirr.location-id=eu-west-1  

Whirr defines roles for hadoop instance. With the properties file above i have asked it to create 1 Namenode and 2 Datanodes.

In theory, i would just fire below command and one would have 3 AWS EC2 instance configured properly and running.

/bin/whirr launch-cluster --config hadoop.properties

But, world does not move on ideal conditions. As expected, i hit problems. It did create 3 instances but those instance had no java and no hadoop. Upon some investigation i found that it was trying to install *open-jdk *and on ubuntu 12,04 dependencies could not be found. Open-jdk is the default java installation option with whirr. It allows one to change to oracle java ( i hate saying it sun java sound far better).

whirr.java.install-function=install_oracle_jdk6  

Adding above line to the properties file would install oracle jdk. Which failed too when i ran the launch-cluster command. That prompted me to modifying certain shell script files in whirr to fire a* sudo apt-get update* before trying to install java. I modified scripts inside

whirr-0.8.1/core/src/main/resources/functions/  

One can find installoraclejdk6.sh and install_openjdk.sh. It can not be that easy, right? It was not. Just modifying these scripts does not mean they would be picked up by whirr command line. Whirr command uses jars included in a lib folder. To have these modified scripts picked up at run time, one needs to put these in the jar files or build whirr. For building whirr

  • install maven
  • install ruby
  • install java

Building it is long shot. I did modify the jar files and ran the command again. No success. Still no java. Then i found this, oab java. And to my surprise it was included in the whirr scripts. I modified my properties file to have

whirr.java.install-function=install_oab_java  

and yes, it worked. It correctly installs sun(oracle) java 6 on the machines and since java was install hadoop was installed too. As i had said: some surprises and some headaches.

(Edit: After writing above para, i have found that one can create a functions directory in the whirr installation directory and putting all modified shell scripts there does the trick.)

In between i tried to install things on ubuntu 11.10 too, to check if it works. It did not. And found few constraints on the clustername that goes in the properties file

#clustername needs to be all lowercase else this error comes:
java.lang.IllegalArgumentException: Object 'youeNAME' doesn't match dns naming constraints.Reason: Should be only lowercase  

Here is what you get if you use an underscore in your cluster name:

Should have lowercase ASCII letters, numbers, or dashes  

Other than that, hey…. it works.

Wait a minute, isn’t “it works” a subjective feeling? Indeed, it is. So lets verify, if things really work.

What? Still no java?

If using installoabjava does not install java for some reason, the last option is to create a custom AMI with java pre-installed. Here is what needs to be done:

Once a new custom AMI is created, use that in the whirr properties file to launch a cluster. And most likely this error would come:

error acquiring SFTPClient() (out of retries - max 7): Invalid packet: indicated length 1349281121 too large  

This is because JClouds is not able to identify the right user to configure machines with. With stock AMIs, Jclouds *knows *that the user is ubuntu, with custom AMIs it does not and it needs to be specified as below:

whirr.bootstrap-user=ubuntu  

Now, run the whirr launch command and it should work!

Installation Succeeds but no nodemanager

If you are using whirrr 0.8.1 or prior(?), its highly likely that you would have your cluster running but all the slave machines would not have nodemanager running. This is due to a bug. The solution is to specifiy mapreduce_version property in UPPERCASE in the property file.

whirr.env.MAPREDUCE_VERSION=2  

Test Hadoop cluster

Thing to note here is that all the above commands were fired on a local machine or a cloud instance which is not part of hadoop cluster being created. Once cluster setup has been done and we have seen success message on the command line like below:

Namenode web UI available at http://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:50070

Open a SSH tunnel from any machine of choice to this machine and check if 50070 shows a web page. You would need SSH keys you used to create the cluster. Your SSH command would look like this:

ssh -i ~/.ssh/whirr_id_rsa ubuntu@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com  

If you get a web page, two things have been verified:

  • SSH to the namenode is working with proper keys
  • Namenode web UI is working

Assuming SSH to the machine worked, verify if java is intalled:

java -version  

Verify if hadoop is installed and running:

hadoop fs -ls /  

Is this enough verification? No. Check and see if you can run some hadoop map reduce examples on the cluster. Whirr provides a proxy settings file which can be used to point local hadoop installation to use cluster configuration. First make sure that you have a local installation running. Once done, change HADOOP_HOME to point to whirr’s hadoop configuration files:

export HADOOP_CONF_DIR=/$HOME/.whirr/yourclustername/  

Start hadoop proxy in a different shell and keep it running:

sh ~/.whirr/yourclustername/hadoop-proxy.sh  

Fire hadoop ls, this should list cluster directories:

hadoop fs -ls /  

Now, run some map reduce jobs to ascertain that its really working and there is nothing wrong with the cluster.

export HADOOP_HOME=/usr/lib/hadoop wget www.nytimes.com hadoop fs -mkdir input hadoop fs -put index.html input hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output  

Above commands assumes that you have already created a directory for your username under /user directory inside hadoop file system. For example, if you are running these commands as ubuntu, there needs to be a /user/ubuntu directory in hdfs and it needs to have proper permissions. If thats not the case, you would get weirdly looking Permission errors, such as

security.UserGroupInformation: PriviledgedActionException  

Whirr creates entire hadoop configuration with proper users: hdfs, mapred, yarn etc. To allow all linux users to run hadoop jobs, one needs to create corresponding directory inside /user directory in hdfs. Besides there are several instructions regarding granting access on /tmp folder etc. Here is the detailed cloudera documentation.