ElasticSearch Query: Performance Optimisation

ElasticSearch Query: Performance Optimisation
In one of my previous posts on elasticsearch, i shared my understanding of elasticsearch configurations and best practices. That was mostly from an indexing perspective. There are several tweaks one can use to optimise query performance as well. Improving querying time can be even more challenging than trying to improve indexing times. Lets see why ...

Why Windows HDInsights is going nowhere

Why Windows HDInsights is going nowhere
Everyone wants to do big data, Microsoft is no exception. Jumping on the bigdata bandwagon and cashing in is something no one wants to miss. We have so many players in the market: AWS, Cloudera, MapR, HortonWorks, IBM, Intel and of course open source Hadoop Ecosystem. Stakes are high and microsoft knows it. That is ...

Tools of the Trade

Tools of the Trade
This is note to self post. I am listing here, some of the tools that i use on a daily basis. Some of these tools are just awesome, like powerline. The picture below is the current state of my terminal window Vi Commands Vim :set paste :%le -> indent everything to the left r -> ...

Performance Tuning Data Load into Hadoop with Sqoop 1

Working with hadoop involves working with huge amounts of data. It also, at times, involves moving huge amounts of data from traditional data stores such as mysql and oracle. Apache Sqoop is an excellent tool that aids in migrating data to and from a hadoop cluster. Data migration into hadoop can become tricky and challenging ...

Impala and Mysql: Comparing apples to oranges

Hadoop has become de-facto standard for big data and batch processing. Think of data pipeline and end up with Hadoop. The Hadoop eco system is changing and changing at a rapid pace. Hadoop serves very well with Hive for batch oriented tasks. But as the eco system move towards maturity, users have started demanding traditional ...

The Dependency Injection Debate

Hackernews is an awesome place. It keeps people involved and interested. There are articles, questions, answers and debates. One such debate going on is about Dependency Injection. It was declared as a non-virtue and subsequently some one else declared it a virtue. Virtue or not, one thing is clear that it is important. While people have strong opinions about it, i have ...

Comparison Matrix: Real time data processing systems

There are several tools/framework available that help process data as it arrives. I had done a comparative study of below four systems in the past: Apache Kafka Facebook Scribe Claudera Flume Apache Chukwa Kafka Scribe Flume Chukwa Current Version 0.61 2.2? 0.9.41,2 0.41 Site & Docs Average Very Poor Good Poor Topology P2P Master/Slave3 Master/Slave3, 4 ...

Hadoop Cluster on AWS VPC with Apache Whirr

Setting up hadoop cluster on cloud providers has been made relatively easy with tools such as apache whirr, cloudera manager, jclouds. Whirr uses jclouds internally. But what if one wanted to create a cluster thats not in the open public cloud? What if one wanted to create a cluster in AWS VPC or on their ...

Hbase Hive Integration

While working with hadoop related technologies one touches several tools/frameworks. Once we have a hadoop cluster running, the next thing we want is to update records and have sql-like features. Hbase provides a way to fulfil the former and Hive fulfils the later. But what fun it would be if we stopped there, we want to connect ...

ElasticSearch Cluster: Configuration & Best Practices 4

ElasticSearch Cluster: Configuration & Best Practices
I was recently working on setting up an elasticsearch cluster with apache whirr. Setting up a cluster is one thing and running it is entirely different. Running a cluster is far more complex than setting one up. Things are no different for an elasticsearch cluster. There are several things one needs to be aware of ...