In the previous session, we learned about the application driver and the executors. We know that Apache Spark breaks our
application into many smaller tasks and assign them to executors. So, Spark executes the application
in parallel. But do you understand the internal mechanics?
That's the topic of this video. We will try to deep dive inside the Spark and investigate the internal mechanics of parallel processing. The objective of this video is to develop a solid understanding of Spark's execution model and Architecture.
So, fasten your seatbelt and let's start for the ride.
All that you are going to do in Apache Spark is to read some data from a source and load it into Spark. Process the data and hold the intermediate results, and finally write the results back to a destination.
But in this process, you need a data structure to hold the data in Spark. We have three alternatives to hold data in Spark.
- Data Frame.
We will learn about all of these. In fact, Spark 2.x recommends using the first two and avoid using RDDs. So, the primary focus of this training will be on first two items. But there is a critical fact to note about RDDs. Data Frames and Datasets, both are ultimately compiled down to an RDD. So, under the hood, everything in Spark is an RDD. And for that reason, I will start with RDDs and try to explain the mechanics of parallel processing.
What is Spark RDD?
Let's define the RDD. The name stands for Resilient Distributed Dataset. However, I can describe it like this.
Spark RDD is a resilient, partitioned, distributed and immutable collection of data.
Let's quickly review this description.
- Collection of data - This one is the most basic thing. They hold data and appears to be a Scala Collection.
- Resilient - That means, they can recover from a failure. RDDs are fault tolerant.
- Partitioned - Spark breaks the RDD into smaller chunks of data. These pieces are called partitions.
- Distributed - Instead of keeping these Partitions on a single machine, Spark spreads them across the cluster. So, they are a distributed collection of data.
- Immutable - Once defined, you can't change them. So, Spark RDD is a read-only data structure.
You can create an RDD using two methods.
- Load some data from a source.
- Create an RDD by transforming another RDD.
Let me show you an example. We will load some data from a file to create an RDD, and then I will show you the number of partitions.
Great, now we know that an RDD is a Partitioned Collection, and we can control the number of partitions. However, that wasn't the objective of this video. At the beginning of this video, I said that I am going to answer the below question.
How does the Spark break our code into a set of tasks and run it in parallel?
Let me ask you a simple question.
Given the above RDD, If I want to count the number of lines in that RDD, Can we do it in parallel?
No brainer. Right? We already have five partitions. I will give one partition to each executor and ask them to count the lines in the given partition. Then I will take the counts back from those executors and sum it. Simple. Isn't it? That's what the Spark does.
Let's execute the code to count the number of lines in this RDD.
Do you want to see what happens under the hood?
Watch the video. The video shows and explains following things.
- Spark Job
- Job Stages
- Spark Tasks
However, let me try to summarize it.
We loaded the file and asked for the count, and hence Spark started one Job. The job is to calculate the count.
Spark breaks that job into five tasks because we had five partitions. And it starts one counting task per partition. A task is the smallest unit of work, and it is performed by an executor.
We talk about stages in the next example.
For this example, I am executing it in local mode, and I have a single executor. Hence all these tasks are executed by the same executor.
You can try the same example on a real multi-node cluster and see the difference.
Great. I think by the end of the video, you will have a fair idea about the parallel processing in Apache Spark.
There are two main variables to control the degree of parallelism in Apache Spark.
- The number of partitions.
- The number of executors.
If you have ten partitions, you can achieve ten parallel processes at the most. However, if you have just two executors,
all those ten partitions will be queued to those two executors.
So far so good. Let's do something little more complicated and take our understanding to the next level.
Spark RDD Example
Here is an example in Scala as well as in Python. Let me quickly explain the code.
Line one loads a text file into an RDD. The file is quite small. If you keep it in HDFS, it may have one or two blocks in
HDFS, So, it is likely that you get one or two partitions by default. However, we want to make five
partitions of this data file.
The line two executes a map method on the first RDD and returns a new RDD. We already know that RDDs are immutable. So, we can't modify the first RDD. Instead, we take the first RDD, perform a map operation and create a new RDD. The map operation splits each line into an array of words. Hence, the new RDD is a collection of Arrays. The third line executes another Map method over the arrayRDD. This time, we generate a tuple. A key value pair. I am taking the second element of the array as my key. And the value is a hardcoded numeric one.
What am I trying to do?
Well, I am trying to count the number of files in each different directory. That's why I am taking the directory name as a key and one as a value. Once I have the kvRDD, I can easily count the number of files. All I have to do is to group all the values by the key and sum up the 1s. That's what the fourth line is doing. The ReduceByKey means, group by key and sum those 1s.
Finally, I collect all the data back from the executors to the driver.
I am assuming that you already know the mechanics of the map and reduce methods. I have covered all these fundamentals in my Scala tutorials. I have also included a lot of content about functional programming. If you are not familiar with those fundamentals, I strongly recommend that you first go through my Scala training to take full advantage of the Spark tutorials.
Let's execute it and see what's happening behind the scenes.
This time I want to use a mult-inode cluster.
Please watch the video. The video shows following things.
- Start a six-node Spark cluster.
- Create the data file on the master node.
- Copy the data file to HDFS.
- Start a Spark Shell.
- Paste the first four lines in the shell.
- Check the Spark UI.
At this stage, you won't see any Spark Jobs. There is a reason for that.
All these functions that we executed on various RDDs are lazy. They don't perform anything until you run a Function that is not lazy. We call the lazy Functions as transformations. The non-lazy functions are Actions.
Spark Transformations and Actions
RDDs offer two types of operations.
The transformation operations create a new distributed dataset from an existing distributed dataset. So, they create a new
RDD from an existing RDD.
The Actions are mainly performed to send results back to the driver, and hence they produce a non-distributed dataset.
The map and the reduceByKey are transformations whereas collect is an action.
All transformations in Spark are lazy. That means, they don't compute results until an action requires them to provide results. That's why you won't see any jobs in your Spark UI.
Now, you can execute the action on the final RDD and check the Spark UI once again.
The video shows you that there is one job, two stages, and ten tasks.
Apache Spark has completed the Job in two Stages. It couldn't do it in a single Stage, and we will look at the reason in a minute. But for now, just realize that the Spark has completed that Job in two stages. We had five partitions. So, I expected five tasks. Since the job went into two Stages, you will see ten Tasks. Five Task for each stage.
You will also see a DAG. The Spark DAG shows the whole process in a nice visualization.
The DAG shows that the Spark was able to complete first three activities in a single stage. But for the ReduceByKey function, it took a new Stage. The question is Why?
The video shows you a logical diagram to explain the reason and shows the shuffle and sort activity. The Shuffle is the main reason behind a new stage. So, whenever there is a need to move data across the executors, Spark needs a new Stage. Spark engine will identify such needs and break the Job into two Stages.
While learning Apache Spark, whenever you do something that needs moving data, for example, a group by operation or a join operation, you will notice a new Stage.
Once we have these key based partitions, it is simple for each executor to calculate the count for the keys that they own. Finally, they send these results to the driver because we executed the collect method.
This video gives you a fair Idea about the following things.
- RDDs – RDDs are the core of Spark. They represent a partitioned and distributed data set.
- Transformations and Actions – We can perform transformations and actions over the RDDs.
- Spark Job – An action on an RDD triggers a job.
- Stages – Spark breaks the job into stages.Shuffle and Sort – A shuffle activity is a reason to break the job into two stages.
- Shuffle and Sort – A shuffle activity is a reason to break the job into two stages.
- Tasks – Each stage is executed in parallel tasks. The number of parallel tasks is directly dependent on the number of partitions.
- Executors – Apart from the tasks, the number of available executors is also a constraint on the degree of parallelism.