Thursday, January 28, 2016

Lambda Architecture

Today Let’s think about Lambda Architecture.In this article, I would like to tell you this subject in a simple way. And I will try to give you some examples about Lambda Architecture.

First we should understand what is Lambda Architecture….

In Big Data World, We need to deal with 2 kinds of basic processes. Those are Batch and Online processes. To put these operations in a design, we should think about an architecture. Then some rules,some classifications, some methods come with these operations and at last we reach the final design of architecture. To get this point a lots of effort has been done.

Actually, I dont want to tell you the history of this architecture. You can easily understand that, Lambda architecture is an architecture to make your Big Data environment more understandable. Because it gives you structures,rules etc…

As I mentioned before, 2 operations are the fundemantal in your environment.Batch and Online.
And we can add a Serving layer to them.
Now we had three processes.So we need to talk about 3 layers. Batch layer, Speed Layer(Online) and Serving Layer…

Batch Layer:  If you have to handle with large quantities of data and you need to produce results,reports etc… from this big collection of data and you dont have any time restrictions , you are in the right layer. In batch layer, you need a nice cluster(master dataset) to load all data which you will work with, and you are responsible to manage this data at this layer.

In this layer you can put your Big Data Cluster for example your Hadoop cluster and you can use a Hadoop Distribution for example Cloudera, Hortonworks etc…

You can run your Batch Map Recude jobs, Spark Core applications, Hive queries, Pig Scripts(pre-compute views) etc… to produce your batch views.(Batch views are the arbitrary query functions.You can think to make data meaningful)

So, this layer has 2 major functionalies;
     1 -      Managing the master dataset (an immutable, append-only set of raw data)
     2 -      Pre-compute the batch views


Speed Layer: Most exciting layer. Online data has been collecting and then filtering, making some simple calculations and after that point used to perpare  Real-Time views. For example you can get your web site’s logs and if a customer buy sone of your products you can recommend another one immediately.

In this layer, First you need to get online data.If you like to listen logs, you can use Logstash, or you can find another open source Technologies. After that you have to move your data to a realtime computational system(Storm is great). To move your data to a realtime computational system like Storm,  Kafka is my recommendation.

After that in Storm you can do your computation and prepare your Real-Time view and send your view a NoSQL Database like HBase,MongoDB,Cassandra…

Therefore Speed Layer can be like this;

Logstash -> Kafka -> Storm -> Cassandra(HBase)

Serving Layer: Ad-hoc query time. I think this is the best layer.You have prepared your batch views and real-time views. So you have analytical and transactional data at the same place for example in a NoSQL Database. 

In this architecture HBase is the most recommended one. Because it is in Hadoop Cluster and it can handle TB’s of data easily. Bu HBase is not good for preparing analytical reports. You should not forget that.



Therefore we can join both views form other two layers and serve as ad-hoc queries.Before that please not forget to index views.So you can decrease the latency.

I hope you understand the Lambda Architecture and this article gives you an idea about your next Lambda Architecture.

Best Regards,


OD