Art of Big Data

Thursday, January 28, 2016

Lambda Architecture

Today Let’s think about Lambda Architecture.In this article, I would like to tell you this subject in a simple way. And I will try to give you some examples about Lambda Architecture.

First we should understand what is Lambda Architecture….

In Big Data World, We need to deal with 2 kinds of basic processes. Those are Batch and Online processes. To put these operations in a design, we should think about an architecture. Then some rules,some classifications, some methods come with these operations and at last we reach the final design of architecture. To get this point a lots of effort has been done.

Actually, I dont want to tell you the history of this architecture. You can easily understand that, Lambda architecture is an architecture to make your Big Data environment more understandable. Because it gives you structures,rules etc…

As I mentioned before, 2 operations are the fundemantal in your environment.Batch and Online.

And we can add a Serving layer to them.

Now we had three processes.So we need to talk about 3 layers. Batch layer, Speed Layer(Online) and Serving Layer…

Batch Layer: If you have to handle with large quantities of data and you need to produce results,reports etc… from this big collection of data and you dont have any time restrictions , you are in the right layer. In batch layer, you need a nice cluster(master dataset) to load all data which you will work with, and you are responsible to manage this data at this layer.

In this layer you can put your Big Data Cluster for example your Hadoop cluster and you can use a Hadoop Distribution for example Cloudera, Hortonworks etc…

You can run your Batch Map Recude jobs, Spark Core applications, Hive queries, Pig Scripts(pre-compute views) etc… to produce your batch views.(Batch views are the arbitrary query functions.You can think to make data meaningful)

So, this layer has 2 major functionalies;

1 - Managing the master dataset (an immutable, append-only set of raw data)

2 - Pre-compute the batch views

Speed Layer: Most exciting layer. Online data has been collecting and then filtering, making some simple calculations and after that point used to perpare Real-Time views. For example you can get your web site’s logs and if a customer buy sone of your products you can recommend another one immediately.

In this layer, First you need to get online data.If you like to listen logs, you can use Logstash, or you can find another open source Technologies. After that you have to move your data to a realtime computational system(Storm is great). To move your data to a realtime computational system like Storm, Kafka is my recommendation.

After that in Storm you can do your computation and prepare your Real-Time view and send your view a NoSQL Database like HBase,MongoDB,Cassandra…

Therefore Speed Layer can be like this;

Logstash -> Kafka -> Storm -> Cassandra(HBase)

Serving Layer: Ad-hoc query time. I think this is the best layer.You have prepared your batch views and real-time views. So you have analytical and transactional data at the same place for example in a NoSQL Database.

In this architecture HBase is the most recommended one. Because it is in Hadoop Cluster and it can handle TB’s of data easily. Bu HBase is not good for preparing analytical reports. You should not forget that.

Therefore we can join both views form other two layers and serve as ad-hoc queries.Before that please not forget to index views.So you can decrease the latency.

I hope you understand the Lambda Architecture and this article gives you an idea about your next Lambda Architecture.

Best Regards,

OD

Thursday, December 31, 2015

What is CAP Theorem?

In this blog , you will find the answer of this kind of questions.

First of all, knowing the subject theoretically makes experiencing on new technologies easier.

So you can easily understand why this technology behave like this etc...

Today's subject CAP Theorem.

CAP Theorem is related with the Database systems. In Big Data world, Databases have very important place. They can be used both source and target. That means you can write your data to a Database or you can feed your Big Data Architecture from the data which is stored in a Database.

There are many kinds of Databases. I will not mention this today. I will mention how Databases behave when they run and how CAP theorem can be used to classify their behaviour.

As you can understand this theorem is completely about the Database behaviours.

First of all, please be sure that you know what is Distributed Systems. :) Today only CAP Theorem.

CAP Theorem is designed for Distributed Systems actually. There are 3 main points:

C for Consistency

A for Availability

P for Partition Tolerance

What are these terms?

Before explaining that, lets assume we have 3 nodes Distributed System.And We have installed them a kind of NoSQL Database.(MongoDB,Cassandra,HBase etc...)

Consistency: If you want to reach your new Distributed NosSQL Database and when you send a request, you should have the same response even if you send the request from different nodes.

For example; To node1 an update has come and after that update suddenly node1 has died.

And at the same time you have send request from node2.

What happens now???

If your system is Consistent, you get the updated response, but if not you get the older version of the response.

It is simple like that.

Who is Consistent?

In NoSQL DB's MongoDB,Hbase,MemCacheDB,Redis are popular Consistent examples.

Availability: This is what I like most. In some ways, every nodes have to return response. Because the whole system has Availability. In nodes there can be replicas of the system or there can be such a controller system(like Zookeeper, details later).

For example: This time you have send request the system from node2 and then node2 has died.

If your system has Availability then you get a result finally.It is sure.

Lets change the situation and make it more difficult.You update data form node1 and then node1 has gone almost the same time. Your system has Availability. OK? So you get the response for sure.

Is it the updated data?

If you dont have Consistency system unfortunately you get the older version of the data.

But If you have Consistency you get the correct updated response.

Therefore Availability is great but by itself it is not enough :(

Who has Availability ?

In NoSQL DB's Cassandra,CouchDB,Riak are popular Availability examples.

Partition Tolerance: This property is very useful too. It means the system works well even if some of the nodes are down. Actually above examples and questions we have already met with this property. Because our nodes have died but we still have response. This property a kind of prerequest of being a Distributed System.

So we can clearly see that these 3 properties have to work together. But unfortunately working at the same time for all of these 3 is impossible. This is it because of Partition Tolerance.

Your system can be Available Partition(AP) or Consistent Partition(CP) if Distributed but if not distributed it can be Consistent Available(CA). Relational Databases are the best examples of CA structure.

Why do we have to choose 3 of these 2?

In a distributed system, when there is an inevitable network partition (and the cluster breaks into two or more “islands”), you can’t guarantee both Availability (for updates) and Consistency.

Finally we can reach this famous triangle picture.

It is very useful and great summary of CAP Therom:

I believe we all understood the CAP Theorem.

Wait for the next Big Data Topics,

Best Regards,

Tuesday, December 22, 2015

Intro

Welcome Art of Big Data

First of all this blog is only designed for Big Data and Big Data related topics.
The target is make this blog a kind of dictionary about Big Data.

Especially to share my experiences, my education subjects and my research topics will be written to this blog. If you want to be a certified expert about Big Data, this blog is designed for you.

If you want to find the answer of the question "What is Big Data?", please google it.
You can find related topics and lessons, examples, use cases and architectural side of Big Data concept.

Every week at least one subject will be in this blog.

I hope this blog will be useful for all of us and Art of Big Data will be a significant resource for experts all around the world.

Best Regards,

OD