Today Let’s think
about Lambda Architecture.In this article, I would like to tell you this
subject in a simple way. And I will try to give you some examples about Lambda
Architecture.
First we should
understand what is Lambda Architecture….
In Big Data World,
We need to deal with 2 kinds of basic processes. Those are Batch and Online processes.
To put these operations in a design, we should think about an architecture.
Then some rules,some classifications, some methods come with these operations
and at last we reach the final design of architecture. To get this point a lots
of effort has been done.
Actually, I dont
want to tell you the history of this architecture. You can easily understand
that, Lambda architecture is an architecture to make your Big Data environment
more understandable. Because it gives you structures,rules etc…
As I mentioned
before, 2 operations are the fundemantal in your environment.Batch and Online.
And we can add a
Serving layer to them.
Now we had three processes.So
we need to talk about 3 layers. Batch layer, Speed Layer(Online) and Serving
Layer…
Batch Layer: If you have
to handle with large quantities of data and you need to produce results,reports
etc… from this big collection of data and you dont have any time restrictions ,
you are in the right layer. In batch layer, you need a nice cluster(master
dataset) to load all data which you will work with, and you are responsible to
manage this data at this layer.
In this layer you
can put your Big Data Cluster for example your Hadoop cluster and you can use a
Hadoop Distribution for example Cloudera, Hortonworks etc…
You can run your
Batch Map Recude jobs, Spark Core applications, Hive queries, Pig Scripts(pre-compute
views) etc… to produce your batch views.(Batch views are the arbitrary query
functions.You can think to make data meaningful)
So, this layer
has 2 major functionalies;
1 -
Managing the master dataset (an immutable,
append-only set of raw data)
2 -
Pre-compute the batch views
Speed Layer: Most exciting layer. Online data has been
collecting and then filtering, making some simple calculations and after that
point used to perpare Real-Time views.
For example you can get your web site’s logs and if a customer buy sone of your
products you can recommend another one immediately.
In this layer,
First you need to get online data.If you like to listen logs, you can use
Logstash, or you can find another open source Technologies. After that you have
to move your data to a realtime computational system(Storm is great). To move your data to a realtime computational system like Storm, Kafka is my recommendation.
After that in
Storm you can do your computation and prepare your Real-Time view and send your
view a NoSQL Database like HBase,MongoDB,Cassandra…
Therefore Speed
Layer can be like this;
Logstash ->
Kafka -> Storm -> Cassandra(HBase)
Serving
Layer: Ad-hoc query time. I think this is the best layer.You have prepared
your batch views and real-time views. So you have analytical and transactional
data at the same place for example in a NoSQL Database.
In this architecture
HBase is the most recommended one. Because it is in Hadoop Cluster and it can
handle TB’s of data easily. Bu HBase is not good for preparing analytical
reports. You should not forget that.
Therefore we can
join both views form other two layers and serve as ad-hoc queries.Before that
please not forget to index views.So you can decrease the latency.
I hope you understand the Lambda Architecture and this article gives you an idea about your next Lambda Architecture.
Best Regards,
OD
No comments:
Post a Comment