In this session, I will talk about Apache Spark Architecture. We will try to understand various moving parts of Apache Spark,
and by the end of this video, you will have a clear understanding of many Spark related jargons and
the anatomy of Spark Application execution.
We already had an introduction to Apache Spark. I guess you learned enough to answer these questions.
- What is Apache Spark?
A distributed computing platform.
- What do we do with Apache Spark?
We create programs and execute them on a Spark Cluster.
How do we execute Spark Programs?
There are two methods.
- Interactive clients (Scala Shell, Pyspark, Notebooks)
- Submit a job (Spark submit utility)
I already showed you Spark Installation. We accessed Spark using Scala Shell, Pyspark Shell, and Jupyter
notebooks. All of those are interactive clients. That's the first method for executing your code
on a Spark cluster. Most of the people use interactive clients during the learning or development
process. Interactive clients are best suitable for exploration purpose.
But ultimately, all your exploration will end up into a full-fledged Spark application.
It may be a streaming application. For example, reading a news feed as it arrives and applying a machine learning algorithm
to figure out that what type of users might be interested in this news.
It may be a batch job. For example, my YouTube statistics. Every twenty-four hours a batch job reads all the data collected during the period and computes the watch time minutes for that period. Finally, it inserts one record in some database, and I see it on my dashboard.
In both the cases (A long-running streaming job or a periodic batch job), you must package your application and submit it to Spark cluster for execution. That's the second method for executing your programs on a Spark cluster. For a production use case, you will be using this technique. Apache Spark comes with a spark-submit utility. We will learn more about it later, and I will show you various options and how to use Spark Submit.
Great, we have an answer to the following question.
How do we execute our programs on a Spark Cluster?
There are two methods.
- Interactive Clients
- Spark Submit utility
We use interactive clients for exploration and spark-submit for executing a production application.
How Spark executes a program?
Spark is a distributed processing engine, and it follows the master-slave architecture. So, for every Spark App, it will
create one master process and multiple slave processes. In Spark terminology, the master is the driver,
and the slaves are the executors.
Let's try to understand it with a simple example.
Suppose you are using the Spark Submit utility. You execute an application 'A1' using Spark Submit, and Spark will create one driver process and some executor processes for A1. The entire set of driver and executers is exclusive for the application A1.
Now, you submit another application A2, and Spark will create one more driver process and some executor process for A2.
So, for every application, Spark creates one driver and a bunch of executors. Since the driver is the master, it is responsible for analysing, distributing, scheduling and monitoring work across the executors. The driver is also responsible for maintaining all the necessary information during the lifetime of the application.
Now the executors, they are only responsible for executing the code assigned to them by the driver and reporting the status back to the driver.
Great. Now we know that every Spark application has a set of executors and one dedicated driver. The next question is this.
Who executes where?
We have a cluster, and we also have a local client machine. We start the Spark application from our client machine. So, the
question is what goes where?
The executors are always going to run on the cluster machines. There is no exception to this.
But you have the flexibility to start the driver on your local machine or on the cluster itself. You might be wondering that why do we have this flexibility?
I will explain that in a minute but let me formalize this idea.
When you start an application, you have a choice to specify the execution mode, and there are two options.
- Client Mode - Start the driver on your local machine
- Cluster Mode - Start the driver on the cluster.
The Client Mode will start the driver on your local machine, and the Cluster Mode will start the driver on the cluster.
You already know that the driver is responsible for the whole application. If anything goes wrong with the driver, your application state is gone. So, if you start the driver on your local machine, your application is directly dependent on your local computer. You don't want that dependency in a production application. After all, you have a dedicated cluster to run the job. Right?
Hence, the Cluster mode makes perfect sense for production deployment. Because after spark-submit, you can switch off your local computer and the application executes independently within the cluster.
On the other side, when you are exploring things or debugging an application, you want the driver to be running locally. If the driver is running locally, you can easily debug it, or at least it can throw back the output on your terminal. Right?
That's where the client-mode makes more sense over the cluster-mode. And hence, when you start a Spark shell or any other interactive client. You would be using a client mode.
So, if you are running a Spark shell, your driver is running locally within the same JVM process. You won't find a separate driver process. It's only the Spark shell, and the driver is embedded within the shell.
Great, we have an answer to the following question.
How does the Spark execute our programs on a cluster?
You learned the answer. Spark will create one driver and a bunch of executors. If you are using an interactive client, your client tool itself is a driver, and you will have some executors on the cluster. If you are using spark-submit in cluster mode, Spark will start your driver and executors on the Cluster.
Who controls the cluster?
How Spark gets the resources for the driver and the executors?
That's where we need a cluster manager.
As on the date, Apache Spark supports four different cluster managers.
- Apache YARN
- Apache Mesos
YARN is the cluster manager for Hadoop. As of date, YARN is the most widely used cluster manager for Apache Spark.
Apache Mesos is another general-purpose cluster manager. If you are not using Hadoop, you might be using Mesos for your Spark cluster.
I won't call the Kubernetes a cluster manager. In fact, it's a general-purpose container orchestration platform from Google. Spark on Kubernates is not yet production ready but the community is working on it.
Finally, the standalone. This one is a simple and basic cluster manager that comes with Spark and makes it easy to set up a Spark cluster very quickly. I don't think you would be using it in a production environment.
No matter which cluster manager do we use, primarily, all of them delivers the same purpose.
You might be interested in step by step process of resource allocation. You can watch the video, It explains the resource allocation process in the YARN cluster.
Great. We covered How Spark runs on a cluster. We learned that we have two options.
- Client Mode
- Cluster Mode
There is a third option as well. The Local Mode. You can also start Spark application in the Local Mode.
When you don't have enough infrastructure to create a multi-node cluster, and you still want to setup Apache Spark. You might want to di it just for learning purpose. You can use local mode.
The local mode is the most comfortable method to start a Spark Application. In this Mode, you don't need any cluster. Neither YARN nor Mesos, nothing. You just need a local machine and the Spark binaries. You start a Spark Application in local mode. It begins in a JVM, and everything else including driver and executor runs in the same JVM. I showed you a Spark installation in the first video . That demo uses the local mode installation. The local mode is the most convenient method for setting up a quick learning environment.
Great, I talked about five things.
- Client mode
- Cluster mode
- Local mode.
Do you want to see them visually? Sounds interesting?
Whatch the video. The video shows them visually.
I am placing some code below. These are some code and commands that I used in the demo video. You can copy paste the code if you want to try those things yourself.
//Submit a Spark Job in client mode spark-submit --class org.apache.spark.examples.SparkPi spark-home/spark-2.2.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.2.0.jar 1000 //Start an SSh tunnel gcloud compute ssh --zone=us-east1-c --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" "spark22-notebook-m" //Start the chrome browser using the SSH tunnel cd C:\Program Files (x86)\Google\Chrome\Application chrome.exe "http://spark4-m:4040" --proxy-server="socks5://localhost:10000" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp/spark22-notebook //Start a Spark sell with three executors spark-shell --num-executors 3 //Submit a Spark Job in cluster mode spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster file:///usr/lib/spark/examples/jars/spark-examples.jar 1000
Great. We learned and explored many things. But this doesn't end the Spark Architecture discussion. I need at least one more
video to deep dive further into Apache Spark architecture and explore following.
How does the Spark break our application into smaller parts and get it done at the executors?