In this video, I will give you a quick introduction of Apache Spark, it’s technical stack and primary components. Let's start
with some background.
We already know that when we have a massive volume of data, let's say, hundreds of gigabytes
or terabytes, and we want to process it for whatever purpose. It won't be efficient and cost-effective
to do it on a single computer. No matter how big and powerful the individual machine is, we will
surely hit a bottleneck at some point.
We also know that Hadoop offered a revolutionary solution for this problem. Hadoop solved this
problem in two parts.
- HDFS - Offering a distributed storage.
- Map Reduce - Offering a distributed computing engine or a framework.
However, creating Map Reduce programs has been a black magic. Most of the developer community found it difficult to code.
Many of them also criticised M/R for its poor performance.
Later Apache Spark came out of UC Berkley. You can think of Spark as a compelling alternative
or a replacement of Hadoop's Map Reduce. If you came here from Hadoop Map Reduce background, Spark
is 10 to 100 times faster than Hadoop's Map Reduce. If you know nothing about Hadoop, It's okay.
Hadoop is not a prerequisite for learning Spark. However, you can think of Spark as a successor of
Hadoop.
What is Apache Spark?
If you check the
official Spark page, you might notice following
phrases.
"Lightning-fast cluster computing"
"A fast and general engine for large-scale data processing"
Checkout the
Databricks website. They are the
primary force behind Apache Spark.You might notice following.
"Powerful, open source, ease of use and what not"
That's correct. Spark is indeed all of this. But all that sounds like a marketing pitch.
What exactly is Apache Spark?
You can define it like this.
Apache Spark is a fast and general purpose engine for large-scale data processing. Under the
hood, it works on a cluster of computers.
But I am still not sure.
What exactly is Apache Spark?
Let’s try to understand that.
There are two things.
- A cluster computing engine.
- A set of libraries, APIs, and DSLs.
These two things together make Apache Spark.
Now, look at the below diagram.
At the base, you have Spark Core. The Spark core itself has two parts.
- A Computing engine.
- A set of Core APIs.
Apache Spark is a distributed processing engine, but it doesn't come with an inbuilt cluster resource manager and a distributed
storage system. You have to plugin a cluster manager and a storage system of your choice. There are
multiple alternatives. You can use Apache YARN, Mesos, and Kubernetes as a cluster manager for Apache
Spark.
Similarly, for the storage system, you can use HDFS, Amazon S3, Google Cloud storage, Cassandra
File system and much more.
So, one thing is clear. Apache Spark doesn't offer cluster management and storage management
services. However, it has a compute engine. The compute engine provides some basic functionalities
like memory management, task scheduling, fault recovery and most importantly interacting with the
cluster manager and storage system. So, it's the Spark Core, or we can say, the Spark compute engine
that executes and manages our Spark jobs and provides a seamless experience to the end user. You
just submit your Job to Spark, and the Spark core takes care of everything else.
Now, coming back to the second part of Spark Core.
Spark Core APIs
Spark core consists of two APIs.
- Structured API
- Unstructured API
The Structured APIs consists of data frames and data sets. They are designed and optimized to work with structured data.
The Unstructured APIs are the lower level APIs including RDDs, Accumulators and Broadcast variables.
These core APIs are available in Scala, Python, Java, and R. We will learn more about these APIs
as we progress with the tutorial.
Outside the Spark Core, we have four different set of libraries and packages.
- Spark SQL - Allows you to use SQL queries for structured data processing.
- Spark Streaming - Helps you to consume and process continuous data streams.
- MLlib - A machine learning library that delivers high-quality algorithms.
- GraphX - Comes with a library of typical graph algorithms.
These are nothing but a set of packages and libraries. They offer you APIs, DSLs, and algorithms in multiple languages. They directly depend on Spark Core APIs to achieve distributed processing.
Why is Spark popular?
- It abstracts away the fact that you are coding to execute on a cluster of computers. In the best-case
scenario, you will be working with tables like in any other database and using SQL queries. In
the worst-case scenario, you will be working with collections. You will feel like working with
a local Scala or a Python collection.
Everything else, all the complexity of the distributed storage, computation, and parallel programming is abstracted away by the Spark Core. - Spark is a unified platform that combines the capabilities for batch processing, structured data handling with SQL like language, near real-time stream processing, graph processing, and machine learning. All of this into a single framework using your favourite programming language. You can mix and match them to solve many sophisticated requirements.
- Ease of use. If you compare it with Map Reduce code, Spark code is much more short, simple, easy to read and understand. The growing ecosystem and libraries that offer ready to use algorithms and tools. The Spark community is continuously working towards making it more straightforward with every new release.