Apache Whirr: Create Hadoop Cluster Automatically

Creating a hadoop cluster is a task. It becomes all the more challenging because of ever changing/evolving nature of technologies involved. There was Apache Hadoop then we had Cloudera Hadoop and now we have very many (HotronWorks, MapR etc). Regardless of the flavor one chooses, getting a hadoop cluster going would have some surprises and headaches.

One can install a cluster in primarily 3 ways:

  • Start with apache hadoop tarballs and do it yourself
  • Start with CDH tarballs and do it yourself or use Cloudera Manager.
  • Use systems like Puppet, Chef or Whirr
The first approach is good if one wants to learn and see as to what really goes behind the scenes. Second approach works well and its mostly clicking buttons on a web UI (but that does not mean you are done after the last click, things may still not work).

The third approach gives, in theory, everything as second approach less the nice web UI. Puppet and Chef have been around for some time now and they are widely used configuration management systems. Apache Whirr is the new kid on the block. It utilizes puppet, chef and many more systems behind the scenes.

I chose Whirr after glancing through Puppet for a while. Puppet and Chef would require one to learn certain things. With Whirr one need not. Define a configuration file and invoke whirr from command line and voila, you are done…..well, in theory. Lets see how it went with me.
I chose:
  • Whirr 0.8.1
  • AWS
  • Ubuntu 12.04
  • CDH4 (Yarn)
Whirr can be installed on one’s local machine or on a cloud instance. We will see how whirr comes up with proxy settings so that one can launch map reduce jobs from the machine hadoop cluster was created. To get started one would do the following:
  • Download and install whirr
  • Set up SSH keys to be used with cluster setup
With whirr one defines a properties file for the cluster configuration, for an example:

Whirr defines roles for hadoop instance. With the properties file above i have asked it to create 1 Namenode and 2 Datanodes.

In theory, i would just fire below command and one would have 3 AWS EC2 instance configured properly and running.

But, world does not move on ideal conditions. As expected, i hit problems. It did create 3 instances but those instance had no java and no hadoop. Upon some investigation i found that it was trying to install open-jdk and on ubuntu 12,04 dependencies could not be found. Open-jdk is the default java installation option with whirr. It allows one to change to oracle java ( i hate saying it sun java sound far better).

Adding above line to the properties file would install oracle jdk. Which failed too when i ran the launch-cluster command. That prompted me to modifying certain shell script files in whirr to fire a sudo apt-get update before trying to install java. I modified scripts inside

One can find install_oracle_jdk6.sh and install_openjdk.sh. It can not be that easy, right? It was not. Just modifying these scripts does not mean they would be picked up by whirr command line. Whirr command uses jars included in a lib folder. To have these modified scripts picked up at run time, one needs to put these in the jar files or build whirr. For building whirr

  • install maven
  • install ruby
  • install java
Building it is long shot. I did modify the jar files and ran the command again. No success. Still no java. Then i found this, oab java. And to my surprise it was included in the whirr scripts. I modified my properties file to have

and yes, it worked. It correctly installs sun(oracle) java 6 on the machines and since java was install hadoop was installed too. As i had said: some surprises and some headaches.

(Edit: After writing above para, i have found that one can create a functions directory in the whirr installation directory and putting all modified shell scripts there does the trick.)

In between i tried to install things on ubuntu 11.10 too, to check if it works. It did not. And found few constraints on the clustername that goes in the properties file

  • clustername needs to be all lowercase else this error comes:

  • Here is what you get if you use an underscore in your cluster name:

Other than that, hey…. it works.

Wait a minute, isn’t “it works” a subjective feeling? Indeed, it is. So lets verify, if things really work.

What? Still no java?

If using install_oab_java does not install java for some reason, the last option is to create a custom AMI with java pre-installed. Here is what needs to be done:

  • Pick a AMI and start a machine
  • SSH to that machine and install java using oab_java
  • Check if java is properly installed
  • Now create a AMI
Once a new custom AMI is created, use that in the whirr properties file to launch a cluster. And most likely this error would come:

This is because JClouds is not able to identify the right user to configure machines with. With stock AMIs, Jclouds knows that the user is ubuntu, with custom AMIs it does not and it needs to be specified as below:

Now, run the whirr launch command and it should work!

Installation Succeeds but no nodemanager

If you are using whirrr 0.8.1 or prior(?), its highly likely that you would have your cluster running but all the slave machines would not have nodemanager running. This is due to a bug. The solution is to specifiy mapreduce_version property in UPPERCASE in the property file.

Test Hadoop cluster

Thing to note here is that all the above commands were fired on a local machine or a cloud instance which is not part of hadoop cluster being created. Once cluster setup has been done and we have seen success message on the command line like below:

Open a SSH tunnel from any machine of choice to this machine and check if 50070 shows a web page. You would need SSH keys you used to create the cluster. Your SSH command would look like this:

If you get a web page, two things have been verified:

  • SSH to the namenode is working with proper keys
  • Namenode web UI is working

Assuming SSH to the machine worked, verify if java is intalled:

Verify if hadoop is installed and running:

Is this enough verification? No. Check and see if you can run some hadoop map reduce examples on the cluster. Whirr provides a proxy settings file which can be used to point local hadoop installation to use cluster configuration. First make sure that you have a local installation running. Once done, change HADOOP_HOME to point to whirr’s hadoop configuration files:

Start hadoop proxy in a different shell and keep it running:

Fire hadoop ls, this should list cluster directories:

Now, run some map reduce jobs to ascertain that its really working and there is nothing wrong with the cluster.

Above commands assumes that you have already created a directory for your username under /user directory inside hadoop file system. For example, if you are running these commands as ubuntu, there needs to be a /user/ubuntu directory in hdfs and it needs to have proper permissions. If thats not the case, you would get weirdly looking Permission errors, such as

Whirr creates entire hadoop configuration with proper users: hdfs, mapred, yarn etc. To allow all linux users to run hadoop jobs, one needs to create corresponding directory inside /user directory in hdfs. Besides there are several instructions regarding granting access on /tmp folder etc. Here is the detailed cloudera documentation.

Leave a Reply