World of Analytics

There’s been a lot of buzz in the tech world lately about analytics. Plenty of people are encouraging you to take a look at your data and build an environment around this. So naturally, you go out and deploy a Hadoop cluster and BOOM, you have an analytics environment. You are “Big Data Ready”. And the truth is you are ready! …Mostly. The extremely interesting and somewhat complicated part of analytics and big data is that there is not a “one size fits all” product to your overall analytics environment. Much like your traditional IT environment, it involves multiple hardware, software and application teams working together to stand up the overall solution. So while Hadoop is great at batch processing and chewing through large datasets, it is going to underwhelm you for real-time analysis.

Stream Processing in Hadoop

I don’t want this to sound like I’m bashing Hadoop at all. I just want to make sure that we do all we can to make sure that we use the proper tools in their correct roles. Simply put, you wouldn’t want a Ferrari to pull a camping trailer, much like you wouldn’t want to go street race your F-150. Fortunately, there has been a large number of open source projects to hit the streets that really excel where Hadoop hasn’t. Bonus: they just some happen to be really great! Now the fun part comes of trying to make sense of where to put what; but don’t let that scare you off… there’s a great blueprint to follow.

Nathan Marz, a software engineer at Twitter, was faced with the task of building a solution to optimize his data processing environments. This solution, coined “Lambda Architecture”, builds a fault tolerant scale-out system that achieves low latency reads and writes across a vast range of workloads and use cases. Basically, he created a tiering platform for your data intensive environment. While it sounds extremely complicated, lets take a look at the photo to see how this works:

Lambda Architecture

  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.

While I hope this diagram helped give you an idea of how Lambda Architecture functions, let’s put some different projects on there to help give you an idea of how this works:

Lambda Architecture for Real Time Analytics

I’m extremely excited for Lambda and the ideology that it brings to the table; tiering in your analytics environment. It helps place the correct workload within the appropriate framework. To put this into a real scenario, check out this post using Lambda Architecture for real-time analysis on Twitter hashtags. This is a huge success for us in the IT field, as it transforms your analytics environment by lowering your ROI, stabilizing your platforms and meeting the varying SLA’s of your business users. Now get out there and build one for yourself.

Recent Posts