Apache Spark Introduction


We already know that when we have a massive volume of data, It won't be efficient and cost-effective to process it on a single computer. No matter how big and powerful is the individual machine, we will surely hit a bottleneck at some point. As an alternative solution, we need a distributed computing platform. A distributed computing platform is a cluster of computers and a set of services. These services pool the resources of the cluster members and offer a mechanism to accomplish massive data processing work.
Hadoop offered a revolutionary solution for this problem. Hadoop solved this problem in two parts.

  1. HDFS - Offering a distributed storage.
  2. Map Reduce - Offering a distributed computing engine.

However, creating map-reduce programs has been a black magic. The developer community found it hard to develop and manage map-reduce programs. Many of them also criticised map-reduce for its poor performance.


Apache Spark

Later Apache Spark came out of UC Berkley. The community started looking at Spark as a compelling alternative or a replacement of Hadoop's map-reduce. However, with time, Apache Spark is now a defacto for big data computing. We can describe Apache Spark as following.

Apache Spark is a fast and general purpose engine for large-scale data processing. Under the hood, it works on a cluster of computers.

However, this description is not enough to understand Apache Spark. In this article, I will try to uncover Apache Spark and explain every component of Apache Spark. By the end of this article, you will be able to understand the jargons associated with Apache Spark, and you will also have a clear understanding of Spark's internal working.
Let's start with a diagram that represents the Apache Spark Ecosystem.

Apache Spark Ecosystem
Fig.1- Shows the Apache Spark ecosystem.

Based on the figure shown above, we can break Apache Spark ecosystem into three layers.

  1. Storage and Cluster Manager
  2. Spark Core
  3. Libraries and DSL

Storage and Cluster Manager

Apache Spark is a distributed processing engine. However, it doesn't come with an inbuilt cluster resource manager and a distributed storage system. There is a good reason behind that design decision. Apache Spark tried to decouple the functionality of a cluster resource manager, distributed storage and a distributed computing engine from the beginning. This design allows us to use Apache Spark with any compatible cluster manager and storage solution. Hence, the storage and the cluster manager are part of the ecosystem however they are not part of Apache Spark. You can plugin a cluster manager and a storage system of your choice. There are multiple alternatives. You can use Apache YARN, Mesos, and even Kubernetes as a cluster manager for Apache Spark. Similarly, for the storage system, you can use HDFS, Amazon S3, Google Cloud storage, Cassandra File system and many others.

Spark Core

Apache Spark core contains two main components.

  1. Spark Compute engine
  2. Spark Core APIs

The earlier discussion makes one thing clear that Apache Spark does not offer cluster management and storage management services. However, it has a compute engine as part of the Spark Core. The compute engine provides some basic functionalities like memory management, task scheduling, fault recovery and most importantly interacting with the cluster manager and storage system. So, it is the Spark compute engine that executes and manages our Spark jobs and provides a seamless experience to the end user. You just submit your Job to Spark, and the Spark core takes care of everything else.

The second part of Spark Core is core API. Spark core consists of two types of APIs.

  1. Structured API
  2. Unstructured API

The Structured APIs consists of data frames and data sets. They are designed and optimized to work with structured data. The Unstructured APIs are the lower level APIs including RDDs, Accumulators and Broadcast variables. These core APIs are available in Scala, Python, Java, and R.


Libraries and DSL

Outside the Spark Core, we have four different set of libraries and packages.

  1. Spark SQL - It allows you to use SQL queries for structured data processing.
  2. Spark Streaming - It helps you to consume and process continuous data streams.
  3. MLlib - It is a machine learning library that delivers high-quality algorithms.
  4. GraphX - It comes with a library of typical graph algorithms.

These are nothing but a set of packages and libraries. They offer you APIs, DSLs, and algorithms in multiple languages. They directly depend on Spark Core APIs to achieve distributed processing.

Why is Spark so popular?

At very high level, there are three main reasons for its popularity and rapid adoption.

1. It abstracts away the fact that you are coding to execute on a cluster of computers. In the best case scenario, you will be working with tables like in any other database and using SQL queries. In the worst case scenario, you will be working with collections. You will feel like working with a local Scala or a Python collection. Everything else, all the complexity of the distributed storage, computation, and parallel programming is abstracted away by the Spark Core.

2. Spark is a unified platform that combines the capabilities for batch processing, structured data handling with SQL like language, near real-time stream processing, graph processing, and machine learning. All of this into a single framework using your favorite programming language. You can mix and match them to solve many sophisticated requirements.

3. Ease of use. If you compare it with Map Reduce code, Spark code is much more short, simple, easy to read and understand. The growing ecosystem and libraries that offer ready to use algorithms and tools. The Spark community is continuously working towards making it more straightforward with every new release.
Now we understand the Spark Ecosystem. Continue reading to uncover the internals.

Read Next

Spark Introduction | Spark Internals | Parallel Processing in Apache Spark

By Prashant Pandey -



You will also like: