Apache Spark Foundation Course - Spark Apps – Scala Vs Python for Apache Spark


Welcome back to Learning Journal. In this video, I am going to talk about the choices of the Spark programming languages. You already know that Spark APIs are available in Scala, Java, and Python. Recently Spark also started supporting R programming language. Spark APIs are available in these four main languages. However, there are few more language bindings that the open source community is working. You can get a list of those languages in the Spark documentation. There are a bunch of people working to develop few other language bindings that are still not listed on this page.


Best language for Spark

The language selection is a common question among the Spark learners. Should I use Scala or Python? Is there any downside of choosing Python over Scala? I often get this question from many people. As the number of Spark language bindings are growing, this question is becoming more and more critical. I am not going to recommend a language to you. Instead, I will talk about the various considerations that should form a basis for your language selection. With that knowledge, you will be empowered to take an appropriate decision.
Great! Let's start.

How to select a Spark programming language?

There are three main parameters that we can use to evaluate a language for our Spark applications.

  1. Skills and the proficiency in the language.
  2. Design goals and the availability of the features.
  3. Performance Implications.

Availability of the skill is one of the critical factors. If you don't have Scala programmers in your team or if you are facing difficulty in building a Scala team but you already have a bunch of Python developers, you will give a plus to the Python over Scala. Right?
Similarly, if your data science team is more comfortable with R language, then you would want to prefer R over Python.
Your design goals and the capability of the language to model your requirement is another deciding factor. If you are willing to use some feature or the API that is not available in R language, you will give a minus to the R language. Similarly, if you are willing to work with a specific type of binary data format, and you already have a Java library to read and manipulate that format, you would want to give a plus to the Java or the Scala language.

Why to compromise with one Spark language?

The most important point to note here is that you have a choice in your hand and it is not necessary to do everything or the entire project in a single language. We often see people doing few jobs in one language and other jobs in a different language. The idea is to break your project into smaller parts and do each piece in the most suitable language.


Apache Spark language types

Great! The last piece of this puzzle is the performance of the code. To understand the performance implications of a language, we need to deep dive into the Spark Architecture. To start that discussion, let's classify the available Spark languages into two categories.

  1. JVM languages. This category includes Scala and Java.
  2. Non-JVM languages. This category is for Python and R.

Now the next thing that we need to understand is the answer to the following question.
How does a language binding works in Spark?
I mean, Spark core is written in Scala. Then how does it execute the Python code or the R language code? Once you learn the high-level architecture of the language binding, you will have a good understanding of the performance bottlenecks.

Spark performance for Scala

Let's start with a Spark architecture refresher.

Spark Execution Model for Scala and Java
Fig.1 - Spark Execution Model for Scala and Java

When we submit a Spark application to the cluster, the framework creates a driver process. Spark framework is a Scala code, and hence the driver process is a JVM process. Right? Now assume you wrote your Spark application in a JVM language. There is no complication to load your code in a JVM. Your application is in Java or Scala, and the driver is a JVM process. Everything is in the same language. Right?
What happens next?
Since Spark is a distributed computing framework, the driver process does not execute all the code locally. Instead, the driver will serialize and send the application code to the worker processes. The worker processes are responsible for the actual execution. Right?
These worker processes are also a JVM process. Why? That is because the Apache Spark framework is a JVM application. Right? Okay.
We wrote our job in a JVM language, and hence it executes in the worker JVM. If our application code is loading some data from a distributed storage, that data also comes and sits in the JVM. So, the application code and the data, both are in JVM at the worker end and hence it works smoothly without any complication. That is a high-level overview of how things work when everything is in Scala or Java.


Spark performance for Python

The complication starts when we write code using a Non-JVM language like Python or R. For the simplicity, we will talk about Python only. However, the architecture is almost same for the R application as well.

Spark Execution Model for Python and R
Fig.2 - Spark Execution Model for Python and R

Let's try to understand the process when you submit a Python application to the Spark cluster.
Same as earlier, the Spark framework starts a Spark driver process.
But this time, the driver is a Python process because you submitted a Python code. We can't run Python code in a JVM, so the framework starts a Python driver. However, the Spark core implementation is in Scala. Those APIs can't execute in a Python process. So, the Spark framework also starts a JVM driver. Now we have two driver processes. The first driver is a Python driver or an R driver depending upon your application code. The second driver is always a JVM driver. The Python driver talks to the JVM driver using a socket-based APIs. What does that mean?
Let me explain. You have written a Python code, and you are making calls to the PySpark APIs, but ultimately, your API calls are used to invoke a Java API on the JVM using a socket-based API call.
The JVM driver is responsible for creating all the necessary objects and garbage collect them when they get out of the scope in Python. That's a complicated architecture, and we don't need to get into too many details of it. However, the point that I want to make is straightforward.
Your Python and R code ultimately invokes a Java API. That inter-process communication happens over socket-based APIs, and it requires some overhead. However, the Spark creators claim that the cost is almost fixed and negligible, and they might be improving the architecture further over the new releases.
However, if you are collecting the output data back to your driver from the workers, the cost is variable depending on the amount of data. The output takes some additional time to serialize and ship that data from the JVM to the Python driver. This serialization can take longer depending on the amount of data that you want to retrieve. However, if you are not running some interactive workloads and saving the output to some storage instead of bringing it back to the driver, you don't have to worry about this overhead.
So, there is not much problem on the driver side if you are not pulling data back to your driver. The only problem is that you will have two different processes and they will spend a few hundred milliseconds for passing control messages. That's it.
Now let's look at the worker side. We already know that the JVM driver is responsible for starting the worker process and you will always have a JVM work on the executer nodes. That's the Spark standard. Since your PySpark APIs are internally invoking the Spark Java APIs on the driver, it is the Java code that needs to be serialized and sent to the workers for execution. So, in a typical case, your APIs are translated into the Java APIs, and the Spark can execute everything in a JVM worker.
However, the problem starts when you write a pure Python code that doesn't have a corresponding Spark Java API. One example is to use a third-party Python library that doesn't have a similar Java API implementation. Another example is to create a UDF in Python that uses pure Python code and call that UDF in your application.
In these two scenarios, your Python code doesn't have a JAVA API. So, your driver can't invoke a Java Method using a socket-based API call. This scenario doesn't arrive if you restrict your code to the PySpark APIs. However, if you create a Python UDF or use a third-party Python library that doesn't have a PySpark implementation, you will have to deal with the Python code problem.
What do you think? What will happen in that situation?
The driver will serialize the Python code and send it to the worker. The worker will start a Python worker process and execute the Python code.
This python worker process is an additional overhead for the worker because you already have a JVM worker as well.
Invoking one or multiple Python workers and coordinating work between JVM and Python process is an additional overhead for Spark.
The Spark creators claimed that this kind of coordination is not a big deal and for a typical Spark job, the communication effort has a negligible impact on performance. However, if they need to move data between these workers process, it may severely impact the performance. If you have a Python function that needs to work on each row of the partition, then the JVM worker must serialize the data and send it to the Python worker. The Python worker will transform that row and serialize the output back to the JVM. The most critical point is that the problem scales with the volume of the data. The JVM and the Python process also compete for the memory on the same worker node. If you don't have enough memory for the Python worker, you may even start seeing exceptions. That is where the Python and R languages start bleeding on a Spark cluster.
However, if you have created your UDF in Scala or used a Java/Scala library, everything happens in the JVM, and there is no need for moving data from one process to another.
Great! I hope you have some understanding about the language selection. You can use this knowledge to make a right decision. However, the bottom line is that the Spark application is a set of a bunch of jobs and you can't settle for a single language for all the task. You must take that decision on a job to job basis.
Thank you for watching Learning Journal. Keep learning and keep growing.


You will also like: