Chapter 1 - Dawn of Bigdata
The internet and the world wide web had a revolutionary impact on our life, culture, commerce, and technology. The search engine has become a necessity in our daily life. However, until the year 1993, the world wide web was indexed by hand. Sir Tim Berners-Lee used to edit that list by hand and host it on the CERN web server. Later, Yahoo emerged as the first popular method for people to find web pages on the internet. However, it was based on a web directory rather than a full-text index of the web pages. Around the year 2000, a new player Google rose to the prominence by an innovative idea of PageRank algorithm. The PageRank algorithm was based on the number and the domain authority of other websites that link to the given web page. One of the critical requirements of implementing the PageRank algorithm was to crawl and collect a massive amount of data from the world wide web. This is the time and the need that I would attribute to the birth of Big data.
Google was not only the first to realize the Big data problem, but also the first to develop a viable solution and establish a commercially successful business around the solution. At a high level, Google had four main problems to solve.
- Discover and crawl the web pages over world wide web and collect the information about those pages.
- Store the massive amount of data collected by the web crawler.
- Apply the PageRank algorithm on the received data and create an index to be used by the search engine.
- Organize the index in a random-access database that can support high-speed queries by the Google search engine application.
These four problems are at the core of any data-centric application, and they translate into four categories of the problem domain.
- Data Ingestion/Integration
- Data storage and management
- Data processing and transformation
- Data access and retrieval
These problems were not new, and in general, they existed even before the year 2000 when Google was trying to solve them using a new and innovative approach. The industry was already generating, collecting, processing and accessing data and the database management systems (DBMS) like Oracle and MS SQL server were in the centre of such applications. At the time, some organizations were already managing terabytes of data volume using systems like Teradata. However, the problem that Google was trying to solve had a combination of three unique attributes that made it much more difficult to be addressed using DBMS of the time. Let’s try to understand these attributes that uniquely characterize the Big data problem.
Creating a web crawler was not a big deal for Google. It was a simple program that takes a page URL, retrieves the content of the page and stores it in the storage system. But in this process, the main problem was the amount of data that the crawler collects. The web crawler was supposed to read all the web pages over the internet and store a copy of those pages in the storage system. Google knew that they will be dealing with the gigantic volume of data and no DBMS at the time was scalable to manage that quantity.
The content of the web pages had no structure. It was not possible to convert them into a row/column format without storing and processing them. Google knew that they need a system to deal with the raw data files that may come in a variety of formats. The DBMS of the time had no meaningful support to deal with the raw data.
Velocity is one of the most critical and essential characteristics of any usable
system. Google needed to acquire data quickly, process it and use it at a faster
rate. The speed of the Google search engine has been its USP since they came into
the search engine business.
It is the speed that is driving the new age of big data applications that need to deal with the data in real-time. In some cases, the data points are highly time sensitive like a patient’s vitals. Such data elements have a limited shelf-life where their value can be futile with time - in some cases, very quickly. I will come back to the velocity once again because the velocity is the driving force of a real-time system – the main subject of this book.
Google successfully solved all the above problems, and they were generous enough to
reveal their solution to the rest of the world in a series of three white papers.
These three whitepapers talked about Google’s approach to solving data storage,
data processing, and the data retrieval.
The first whitepaper was published by Google in the year 2003. The first paper elucidated the architecture of a scalable, distributed filesystem which they termed as Google File System (GFS). The GFS solved their problem of storage by leveraging commodity hardware to store data on a distributed cluster.
The second whitepaper was published by Google in the year 2004. This paper revealed the MapReduce (MR) programming model. The MR model adopted a functional programming style, and Google was able to parallelize the MR functions on a large cluster of commodity machines. The MR model solved their problem of processing large quantities of data in a distributed cluster.
The last paper in this series was published by Google in the year 2006. This paper described the design of a distributed database for managing structured data across thousands of commodity servers. They named it Google Bigtable and used it for storing petabytes of data and serve it at a lower latency.
All these three whitepapers were well appreciated by the open source community, and they formed the basis for the design and development of similar open source implementation – Hadoop. The GFS is implemented as a Hadoop distributed file system – HDFS. The Google MR is implemented as Hadoop MapReduce programming framework on top of HDFS. Finally, the Bigtable was implemented as Hadoop database – HBase.
Since then, there are many other solutions developed over the Hadoop platform and open sourced by various organizations. Some of the most widely adopted systems are Pig, Hive and finally Apache Spark. All these solutions were simplification and improvement over the Hadoop MapReduce framework, and they shared the common trait of processing large volumes of data on a distributed cluster of commodity hardware.
Hadoop and all the other solutions developed over the HDFS tried to solve a common
problem of processing large volumes of data that was previously stored in the
distributed storage. This approach is popularly termed as batch processing. In the
batch processing approach, the data is collected and stored in the distributed
system. Then a batch job that is written in MapReduce is used to read the data and
process the same by utilizing the distributed cluster. While you are processing the
previously stored data, the new data keeps arriving at the storage. Then you must
do the next batch and then take care of combining the results with the outcome of
the previous batch. This process goes on in a series of batches.
In the batch processing approach, the outcome is available after a specific time that depends upon the frequency of your batches and the time taken by the batch to complete the processing. The insights derived from these data processing batches are valuable. However, all such insights are not equal. Some insights may have a much higher value shortly after the data was initially generated and that value diminishes very fast with the time.
Many use cases need to collect the data about the events in real-time as they occur and extract an actionable insight as quickly as possible. For example, a fraud detection system is much more valuable if it can identify a fraudulent transaction even before the transaction completes. Similarly, in a healthcare ICU or in the surgical operation setup, the data from various monitors can be used in real time and generate alerts to nurses and doctors, so they know instantly about changes in a patient’s condition.
In many cases, the data points are highly time-sensitive, and they need to be actioned in minutes, seconds or even in milliseconds. Enormous amounts of time-sensitive data are being generated from a rapidly growing set of disparate data sources. We will discuss some of the prevalent use cases and data sources for such requirements in the next chapter. However, it is essential to comprehend that the demand for the speed in processing the time-sensitive data is continuously pushing the limits of Bigdata processing solutions. These requirements are the source of many innovative and new frameworks such as Kafka Streams, Spark streaming, Apache Storm, Apache Flink, Apache Samza and cloud services like Google cloud dataflow and Amazon Kinesis. These new solutions are evolving to meet the real-time requirements and are called by many names. Some of the commonly used names are Real-time stream processing, Event processing, and Complex event processing. However, in this book, we will be referring it as stream processing.
In the next chapter, I will talk about some popular industry use cases of stream processing. It will help you to develop a better understand of the real-time stream processing requirements and differentiate it from the batch processing needs. You should be able to clearly identify and answer when to use real-time stream processing vs. batch processing.
In this chapter, I talked about the big data problem and how it started. We also talked about the innovative solutions that Google created and then shared it with the world. We briefly touched on the opensource implementation of Google whitepapers under the banner of Hadoop. Hadoop grabbed immense attention and popularity among the organizations and professionals. We further discussed the growing expectation and the need to handle time-sensitive data at speed and why it is fostering new innovations in real-time stream processing. In the next chapter, we will progress our discussion about real-time stream processing to develop a better understanding of the subject area.