Apache Hadoop Foundation Course - Bigdata and Hadoop


Big data is now a dictionary term, and I guess everyone understands it. However, if you ask Google, they give you the following answer.

What is bigdata?
Fig.1- What is bigdata?

One thing is evident with the Google response that it is a large data set. But the question is how much large?

How big is the big data?

Do you call 100GB data set a Big Data? Or it should be in terabytes or petabytes to be called a Big Data? This confusion existed for some time until people agreed to an acceptable definition.
Gartner gives the most widely accepted definition of big data.
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
If you quickly analyse the definition, you will see three characteristics.

  1. Volume
  2. Velocity
  3. Variety

Some people call it 3Vs of big data.
So, coming back to our question.Do we call 100GB data set a Big Data?
In this case, we already know that volume is 100GB. That doesn't appear to be too big. To call it big data, we may need to get details of other two Vs. Velocity and Variety. If we realize that we are getting data at the rate of 100GB per minute and there is a need to store or process it at the same speed, I will want to call it big data.
You may ask me why?
Why would you call it big data? I am still not clear on that.
Ok, let’s look at the other parts of the definition. If you don't have a clear, cost-effective solution for the problem created by a combination of 3Vs, and you need to innovate to solve that problem, believe me, you have a big data problem.
So, Big Data is a problem characterized by 3Vs. To address these issues, cost-effectively, you require innovative thinking and use of innovative tools and techniques. If your question fits into this definition, you have a big data problem.


How Bigdata Started?

This Big data problem started more than a decade ago. People may debate on this, but I believe that it began with the growth and popularity of World Wide Web. The search engine companies like Google and Yahoo were first to recognize it, but soon many Internet-scale companieslike Amazon, and Facebook started realizing the problem. They were in the first row because they had to deal with the internet scale of volume, velocity, and variety.
In today’s world, the data has become synonymous with oil and electricity. Organizations are running on data. Every business has begun to depend on data to derive insights and use them in decision making. Now, they are moving into next level and started using data for automating systems, processes and even decision making.
So, if you haven’t already started learning and becoming comfortable with a whole new fleet of tools and technologies to deal with the Big Data problem, you are losing opportunities just like those organizations that are still not realizing Big Data problem.

History of Bigdata and Hadoop?

Hadoop is one of the toolsets that enables us to deal with Big Data. But Hadoop is not the only one. Many innovations are happening all over, and you will find specialized tools for specific problems. But Hadoop is one of the most successful and widely adopted tools in this space.
As I said, search engine companies were first to realize the Big Data Problem and Google was the first to solve the puzzle. They disclosed the first part of the solution in the year 2003 by publishing a paper “The Google File System.” This article presented a distributed file system to write and read large data efficiently. The second part of the solution was released in the year 2004 by another paper on Map Reduce (MapReduce: Simplified Data Processing on Large Clusters). This article presented a framework to process a file stored on Google file system.
I am not sure how many people realized the importance of those two Google research papers, but Doug Cutting realized it. He recognized the importance because he was working on a project and was facing the Big Data problem. So, He decided to implement both things described in those two Google papers and build it successfully to a demonstrable extent.
Later in the year 2005, He was hired by Yahoo and given a dedicated and talented team to create a distributed computing platform. He named it Hadoop and Yahoo decided to Open Source it. After several iterations, Hadoop 1.0.0 came out in Dec 2011.
So, Initially, The Hadoop had just two core components.

  1. A distributed file system (HDFS) – They called it Hadoop distributed file system.
  2. A distributed programming framework (Map Reduce) – They called it Map Reduce.

Both things are an implementation of those two Google papers.
The open source community released Hadoop 2.0 in May 2012. With this version, they added another core component called YARN.
As we stand today, Hadoop 3.0 alpha2 is out in January 2017. However, there is no new core component introduced in this release.
In upcoming videos, we will learn about these three main parts. We will cover Hadoop 2.x and relevant features introduced in Hadoop 3.x.

What is Hadoop ecosystem?

Now, we know about Hadoop core components. But in today’s world, when people refer Hadoop, they don’t just mean these three essential elements. They incorporate a set of tools that work on top of Hadoop core or around Hadoop core. They call it Hadoop ecosystem. There is no precise definition of what comes under Hadoop ecosystem and what stands outside. But without going into that debate, I want to list some names that are widely considered to be part of Hadoop ecosystem.

  1. Hive
  2. Pig
  3. HBase
  4. Spark
  5. Sqoop
  6. Flume
  7. Kafka
  8. Oozie

There are many more names, but I have listed only those that evolved along Hadoop. This list is to give you an indication of Hadoop ecosystem, but you don’t need to learn all of them. Many of them couldn’t win an ethical adoption. Hive, Spark, and Kafka are the three most widely adopted components from this list.
This tutorial will build a sound foundation for Hadoop core. We will cover individual components separately. We already have a separate Apache Kafka Tutorial and Apache Spark Tutorial. We will create similar tutorials on every other tool. Our focus will be primarily to cover the perspective of developer and application architect.
So please subscribe to our YouTube channel and stay tuned.
Thank you very much. Keep learning and keep growing.


You will also like: