Chapter 1 - Dawn of Bigdata
The internet and the world wide web had a revolutionary impact on our life, culture, commerce, and technology. The search engine has become a necessity in our daily life. However, until the year 1993, the world wide web was indexed by hand. Sir Tim Berners-Lee used to edit that list by hand and host it on the CERN web server. Later, Yahoo emerged as the first popular method for people to find web pages on the internet. However, it was based on a web directory rather than a full-text index of the web pages. Around the year 2000, a new player Google rose to the prominence by an innovative idea of PageRank algorithm. The PageRank algorithm was based on the number and the domain authority of other websites that link to the given web page. One of the critical requirements of implementing the PageRank algorithm was to crawl and collect a massive amount of data from the world wide web. This is the time and the need that I would attribute to the birth of Big data.
Inception
Google was not only the first to realize the Big data problem, but also the first to develop a viable solution and establish a commercially successful business around the solution. At a high level, Google had four main problems to solve.
- Discover and crawl the web pages over world wide web and collect the information about those pages.
- Store the massive amount of data collected by the web crawler.
- Apply the PageRank algorithm on the received data and create an index to be used by the search engine.
- Organize the index in a random-access database that can support high-speed queries by the Google search engine application.
These four problems are at the core of any data-centric application, and they translate into four categories of the problem domain.
- Data Ingestion/Integration
- Data storage and management
- Data processing and transformation
- Data access and retrieval
These problems were not new, and in general, they existed even before the year 2000 when Google was trying to solve them using a new and innovative approach. The industry was already generating, collecting, processing and accessing data and the database management systems (DBMS) like Oracle and MS SQL server were in the centre of such applications. At the time, some organizations were already managing terabytes of data volume using systems like Teradata. However, the problem that Google was trying to solve had a combination of three unique attributes that made it much more difficult to be addressed using DBMS of the time. Let’s try to understand these attributes that uniquely characterize the Big data problem.
- Volume
- Variety
- Velocity
Volume
Creating a web crawler was not a big deal for Google. It was a simple program that takes a page URL, retrieves the content of the page and stores it in the storage system. But in this process, the main problem was the amount of data that the crawler collects. The web crawler was supposed to read all the web pages over the internet and store a copy of those pages in the storage system. Google knew that they will be dealing with the gigantic volume of data and no DBMS at the time was scalable to manage that quantity.
Variety
The content of the web pages had no structure. It was not possible to convert them into a row/column format without storing and processing them. Google knew that they need a system to deal with the raw data files that may come in a variety of formats. The DBMS of the time had no meaningful support to deal with the raw data.
Velocity
Velocity is one of the most critical and essential characteristics of any usable
system. Google needed to acquire data quickly, process it and use it at a faster
rate. The speed of the Google search engine has been its USP since they came into
the search engine business.
It is the speed that is driving the new age of big data applications that need
to deal with the data in real-time. In some cases, the data points are highly time
sensitive like a patient’s vitals. Such data elements have a limited shelf-life
where their value can be futile with time - in some cases, very quickly. I will
come back to the velocity once again because the velocity is the driving force of a
real-time system – the main subject of this book.
Explication
Google successfully solved all the above problems, and they were generous enough to
reveal their solution to the rest of the world in a series of three white papers.
These three whitepapers talked about Google’s approach to solving data storage,
data processing, and the data retrieval.
The first
whitepaper was published by Google in the year 2003. The first paper
elucidated the
architecture of a scalable, distributed filesystem which they termed as Google File
System (GFS). The GFS solved their problem of storage by leveraging commodity
hardware to store data on a distributed cluster.
The second
whitepaper was published by Google in the year 2004.
This paper revealed the MapReduce (MR) programming model. The MR model adopted a
functional programming style, and Google was able to parallelize the MR functions
on a large cluster of commodity machines. The MR model solved their problem of
processing large quantities of data in a distributed cluster.
The last
paper in this series was published by Google in the year 2006.
This paper described the design of a distributed database for managing structured
data across thousands of commodity servers. They named it Google Bigtable and used
it for storing petabytes of data and serve it at a lower latency.
All these three whitepapers were well appreciated by the open source community,
and they formed the basis for the design and development of similar open source
implementation – Hadoop. The GFS is implemented as a Hadoop distributed file system
– HDFS. The Google MR is implemented as Hadoop MapReduce programming framework on
top of HDFS. Finally, the Bigtable was implemented as Hadoop database – HBase.
Since then, there are many other solutions developed over the Hadoop platform
and open sourced by various organizations. Some of the most widely adopted systems
are Pig, Hive and finally Apache Spark. All these solutions were simplification and
improvement over the Hadoop MapReduce framework, and they shared the common trait
of processing large volumes of data on a distributed cluster of commodity hardware.
Expansion
Hadoop and all the other solutions developed over the HDFS tried to solve a common
problem of processing large volumes of data that was previously stored in the
distributed storage. This approach is popularly termed as batch processing. In the
batch processing approach, the data is collected and stored in the distributed
system. Then a batch job that is written in MapReduce is used to read the data and
process the same by utilizing the distributed cluster. While you are processing the
previously stored data, the new data keeps arriving at the storage. Then you must
do the next batch and then take care of combining the results with the outcome of
the previous batch. This process goes on in a series of batches.
In the batch processing approach, the outcome is available after a specific
time that depends upon the frequency of your batches and the time taken by the
batch to complete the processing. The insights derived from these data processing
batches are valuable. However, all such insights are not equal. Some insights may
have a much higher value shortly after the data was initially generated and that
value diminishes very fast with the time.
Many use cases need to collect the data about the events in real-time as they
occur and extract an actionable insight as quickly as possible. For example, a
fraud detection system is much more valuable if it can identify a fraudulent
transaction even before the transaction completes. Similarly, in a healthcare ICU
or in the surgical operation setup, the data from various monitors can be used in
real time and generate alerts to nurses and doctors, so they know instantly about
changes in a patient’s condition.
In many cases, the data points are highly time-sensitive, and they need to be
actioned in minutes, seconds or even in milliseconds. Enormous amounts of
time-sensitive data are being generated from a rapidly growing set of disparate
data sources. We will discuss some of the prevalent use cases and data sources for
such requirements in the next chapter. However, it is essential to comprehend that
the demand for the speed in processing the time-sensitive data is continuously
pushing the limits of Bigdata processing solutions. These requirements are the
source of many innovative and new frameworks such as Kafka Streams, Spark
streaming, Apache Storm, Apache Flink, Apache Samza and cloud services like Google
cloud dataflow and Amazon Kinesis. These new solutions are evolving to meet the
real-time requirements and are called by many names. Some of the commonly used
names are Real-time stream processing, Event processing, and Complex event
processing. However, in this book, we will be referring it as stream processing.
In the next chapter, I will talk about some popular industry use cases of
stream processing. It will help you to develop a better understand of the real-time
stream processing requirements and differentiate it from the batch processing
needs. You should be able to clearly identify and answer when to use real-time
stream processing vs. batch processing.
Summary
In this chapter, I talked about the big data problem and how it started. We also talked about the innovative solutions that Google created and then shared it with the world. We briefly touched on the opensource implementation of Google whitepapers under the banner of Hadoop. Hadoop grabbed immense attention and popularity among the organizations and professionals. We further discussed the growing expectation and the need to handle time-sensitive data at speed and why it is fostering new innovations in real-time stream processing. In the next chapter, we will progress our discussion about real-time stream processing to develop a better understanding of the subject area.
You will also like:
Scala placeholder syntax
What is a scala placeholder syntax and why do we need it? Learn from experts.
Kafka Enterprise Architecture
How do you deploy Apache Kafka in a large Enterprise?
Scala Function Basics
Start learning Scala functions from the basics and become an expert.
When to use <section>
HTML5 has seen the introduction of many sectioning elements that can be used to mark up your web pages.