Apache Spark Introduction
We already know that when we have a massive volume of data, It won't be efficient
and cost-effective to process it on a single
computer. No matter how big and powerful is the individual machine, we will surely
a bottleneck at some point. As an alternative solution, we need a distributed
platform. A distributed computing platform is a cluster of computers and a set of
These services pool the resources of the cluster members and offer a mechanism to
massive data processing work.
Hadoop offered a revolutionary solution for this problem. Hadoop solved this problem in two parts.
- HDFS - Offering a distributed storage.
- Map Reduce - Offering a distributed computing engine.
However, creating map-reduce programs has been a black magic. The developer community found it hard to develop and manage map-reduce programs. Many of them also criticised map-reduce for its poor performance.
Later Apache Spark came out of UC Berkley. The community started looking at Spark as a compelling alternative or a replacement of Hadoop's map-reduce. However, with time, Apache Spark is now a defacto for big data computing. We can describe Apache Spark as following.
Apache Spark is a fast and general purpose engine for large-scale data processing. Under the hood, it works on a cluster of computers.
However, this description is not enough to understand Apache Spark. In this
article, I will try to uncover Apache Spark and
explain every component of Apache Spark. By the end of this article, you will be
to understand the jargons associated with Apache Spark, and you will also have a
understanding of Spark's internal working.
Let's start with a diagram that represents the Apache Spark Ecosystem.
Based on the figure shown above, we can break Apache Spark ecosystem into three layers.
- Storage and Cluster Manager
- Spark Core
- Libraries and DSL
Storage and Cluster Manager
Apache Spark is a distributed processing engine. However, it doesn't come with an inbuilt cluster resource manager and a distributed storage system. There is a good reason behind that design decision. Apache Spark tried to decouple the functionality of a cluster resource manager, distributed storage and a distributed computing engine from the beginning. This design allows us to use Apache Spark with any compatible cluster manager and storage solution. Hence, the storage and the cluster manager are part of the ecosystem however they are not part of Apache Spark. You can plugin a cluster manager and a storage system of your choice. There are multiple alternatives. You can use Apache YARN, Mesos, and even Kubernetes as a cluster manager for Apache Spark. Similarly, for the storage system, you can use HDFS, Amazon S3, Google Cloud storage, Cassandra File system and many others.
Apache Spark core contains two main components.
- Spark Compute engine
- Spark Core APIs
The earlier discussion makes one thing clear that Apache Spark does not offer cluster management and storage management services. However, it has a compute engine as part of the Spark Core. The compute engine provides some basic functionalities like memory management, task scheduling, fault recovery and most importantly interacting with the cluster manager and storage system. So, it is the Spark compute engine that executes and manages our Spark jobs and provides a seamless experience to the end user. You just submit your Job to Spark, and the Spark core takes care of everything else.
The second part of Spark Core is core API. Spark core consists of two types of APIs.
- Structured API
- Unstructured API
The Structured APIs consists of data frames and data sets. They are designed and optimized to work with structured data. The Unstructured APIs are the lower level APIs including RDDs, Accumulators and Broadcast variables. These core APIs are available in Scala, Python, Java, and R.
Libraries and DSL
Outside the Spark Core, we have four different set of libraries and packages.
- Spark SQL - It allows you to use SQL queries for structured data processing.
- Spark Streaming - It helps you to consume and process continuous data streams.
- MLlib - It is a machine learning library that delivers high-quality algorithms.
- GraphX - It comes with a library of typical graph algorithms.
These are nothing but a set of packages and libraries. They offer you APIs, DSLs, and algorithms in multiple languages. They directly depend on Spark Core APIs to achieve distributed processing.
Why is Spark so popular?
At very high level, there are three main reasons for its popularity and rapid adoption.
1. It abstracts away the fact that you are coding to execute on a cluster of computers. In the best case scenario, you will be working with tables like in any other database and using SQL queries. In the worst case scenario, you will be working with collections. You will feel like working with a local Scala or a Python collection. Everything else, all the complexity of the distributed storage, computation, and parallel programming is abstracted away by the Spark Core.
2. Spark is a unified platform that combines the capabilities for batch processing, structured data handling with SQL like language, near real-time stream processing, graph processing, and machine learning. All of this into a single framework using your favorite programming language. You can mix and match them to solve many sophisticated requirements.
3. Ease of use. If you compare it with Map Reduce code, Spark code is much more
short, simple, easy to read and understand.
The growing ecosystem and libraries that offer ready to use algorithms and tools.
Spark community is continuously working towards making it more straightforward with
Now we understand the Spark Ecosystem. Continue reading to uncover the internals.