Impala and Mysql: Comparing apples to oranges

Hadoop has become de-facto standard of big data and batch processing. Think of data pipeline and end up with Hadoop. The Hadoop eco system is changing and changing at a rapid pace. Hadoop serves very well with Hive for batch oriented tasks. But as the eco system move towards maturity, users have started demanding traditional ...

The Dependency Injection Debate

Hackernews is an awesome place. It keeps people involved and interested. There are articles, questions, answers and debates. One such debate going on is about Dependency Injection. It was declared as a non-virtue and subsequently some one else declared it a virtue. Virtue or not, one thing is clear that it is important. While people have strong opinions about it, i have ...

Comparison Matrix: Real time data processing systems

There are several tools/framework available that help process data as it arrives. I had done a comparative study of below four systems in the past: Apache Kafka Facebook Scribe Claudera Flume Apache Chukwa Kafka Scribe Flume Chukwa Current Version 0.61 2.2? 0.9.41,2 0.41 Site & Docs Average Very Poor Good Poor Topology P2P Master/Slave3 Master/Slave3, 4 ...

Hadoop Cluster on AWS VPC with Apache Whirr

Setting up hadoop cluster on cloud providers has been made relatively easy with tools such as apache whirr, cloudera manager, jclouds. Whirr uses jclouds internally. But what if one wanted to create a cluster thats not in the open public cloud? What if one wanted to create a cluster in AWS VPC or on their ...

Hbase Hive Integration

While working with hadoop related technologies one touches several tools/frameworks. Once we have a hadoop cluster running, the next thing we want is to update records and have sql-like features. Hbase provides a way to fulfil the former and Hive fulfils the later. But what fun it would be if we stopped there, we want to connect ...

ElasticSearch Cluster: Configuration & Best Practices 1

I was recently working on setting up an elasticsearch cluster with apache whirr. Setting up a cluster is one thing and running it is entirely different. Running a cluster is far more complex than setting one up. Things are no different for an elasticsearch cluster. There are several things one needs to be aware of ...

Apache Whirr: ElasticSearch Cluster Setup on Amazon EC2

The last time i tried to create a cluster, i had several problems. I was creating a hadoop cluster on amazon ec2. Taking the learnings forward, i decided to create a elasticsearch cluster on ec2 with apache whirr. The idea of launching several machines with one command is really enticing and its wonderful if it works. ...

Common Hadoop (YARN) Errors

While installing or running hadoop one gets different errors at different times. Here is a list of some that i could think of: class org.apache.hadoop.mapred.ShuffleHandler not found This errors comes in the YARN nodemanager. This is due to fact that the system could not find YARN map-reduce folder. In CDH instllation check for /usr/lib/hadoop-mapreduce folder. If its ...

Apache Whirr: Create Hadoop Cluster Automatically

Creating a hadoop cluster is a task. It becomes all the more challenging because of ever changing/evolving nature of technologies involved. There was Apache Hadoop then we had Cloudera Hadoop and now we have very many (HotronWorks, MapR etc). Regardless of the flavor one chooses, getting a hadoop cluster going would have some surprises and ...

AWS for Dummies

Here are some of the basic question that one comes across while working with AWS. I will try and keep adding more to the list. Can i change security group of my EC2  instance? The answer is no. One can change security group of an instance that’s behind VPC only. Once an instance has started, ...