Apache Spark Foundation Course - How to Create and Submit Spark Applications


Welcome back to Learning Journal. We have been learning Spark examples using the REPL. Now it's time to show you a method for creating a standalone spark application. In this video, I will create a simple example and help you understand the process of creating Spark applications. We will also cover how to deploy your applications on a spark cluster. So, let’s start.

Spark Application build tool

Scala is the Spark's native language, and hence we prefer to write spark applications in Scala. However, before you start writing spark applications, you need to decide on your choice of the build tool. You have two popular options.

  1. SBT
  2. Maven

I prefer using SBT for Apache Spark applications. You can also use Maven. However, I find SBT more simple and easy to use. You can download and Install SBT from the SBT website.
If you need more detail for Installing SBT, you can refer my video tutorials on how to install SBT.
I am assuming that you have already installed and configured SBT and you are familiar with the build process.

Understanding build.sbt for Spark application

Every SBT project needs a build.sbt file. There are many things that a complicated project build might require you to configure in the build.sbt file. However, there are three main things that you must Include in your build file.

  1. Project Metadata
  2. Dependencies
  3. Repository Resolvers

Here is an example of my simple build.sbt file. The build file must be placed in your project home directory. So, you should keep it in my project directory. Let's look inside the build.sbt.

                                
    name := "spark Test App"
    version := "0.1"
    organization := "guru.learningjournal"
    scalaVersion := "2.11.8"
    val sparkVersion = "2.2.0"
    
    libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % sparkVersion,
    "org.apache.spark" %% "spark-sql" % sparkVersion) 
                         

The first four lines are to define the project metadata. We give a name and version to our project. We also define the organization name.
The next important decision is to define your Scala version as well as the Spark version. For a Spark project, those two are critical values. Both of those are used to resolve the appropriate packages. You need to choose these values based on your target cluster. So, if your production cluster is using Spark 2.2 for Scala 2.11, then you should select those versions accordingly.
I am going to execute my example application on my local mode cluster. So, I need to make sure that the Spark and Scala versions of my build is in sync with my target cluster. There are many ways to check the Spark version. However, you can quickly check your Spark and Scala versions for your cluster using the Spark Shell.

How to check spark scala version
Fig.1-How to check Spark Scala Version

The next part is to define the dependencies.
In this example, we are adding Spark core and Spark SQL. However, you might need to add other dependencies like spark mllib, spark streaming, spark hive, etc. depending upon your application.
In case of our example, the SBT will search for following spark packages.

                                
    org.apache.spark:spark-core_2.11:2.2.0 
    org.apache.spark:spark-sql_2.11:2.2.0    
                         

If you carefully look at the package name, it includes the Scala version 2.11 and Spark version 2.2.0.
And that's why knowing your target cluster versions and defining your Spark, and Scala versions are critically important.

Resolving external Spark dependencies

The last thing is to define a repository resolver. SBT comes with a predefined maven2 repository. So, if your dependencies are available in maven, then you don't need to specify a custom resolver. We only set the dependencies and SBT already knows where to look for those artifacts. Most of the Spark artifacts are available in Maven repository, so you might not even need to define a custom resolver. However, for learning, let's add a new dependency that is not available through standard maven repo. I am adding kafka-avro-serializer by confluent, and this artifact is not available in the standard Maven repository. I have listed the content of the new build file.

                                
    name := "spark Test App"
    version := "0.1"
    organization := "guru.learningjournal"
    scalaVersion := "2.11.8"
    val sparkVersion = "2.2.0"

    libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % sparkVersion,
    "org.apache.spark" %% "spark-sql" % sparkVersion,
    "io.confluent" % "kafka-avro-serializer" % "3.1.1")
    
    resolvers += "confluent" at "http://packages.confluent.io/maven/"
        
                         

There is one more important point to note here. The Kafka Avro serializer is a Java artifact, and the package doesn't include Scala version in its name. So, we don't want to add the Scala version in the package name, and hence we use a single % symbol for kafka avro serializer. In this case, the SBT will look for the following artifact.

                                
    io.confluent:kafka-avro-serializer:3.1.1    
                         

The effect of removing the % symbol is that it doesn't include the Scala version in the artifact name like earlier.
Great! Now I must configure a dependency resolver for this artifact. And that's what the next line defines. We define that the SBT should also look at the following URL for the required dependencies.

                                
    http://packages.confluent.io/maven/    
                         

However, this dependency is not needed for our example, I added it here to show you an example. But for now, let's remove it from our build file to avoid unnecessary downloads and potential conflicts.
Great! That's all about the build file.

Developing and compiling Spark Application

We are now ready to write our Spark application code. In an ideal case, we need a project directory structure. So, the Scala source code should reside inside the src/main/scala directory. So, you should create the necessary directory structure and a test.scala file as /src/main/scala/test.scala.
Open the file in a text editor and you are ready to write some code.
I am assuming that you already know Scala, if not, I have an excellent tutorial for Scala. Start learning today.
We want to create a command line spark application, and that's the essential requirement to able to submit the application to a Spark cluster.

                                
    package guru.learningjournal.examples

    import org.apache.spark.sql._
    import org.apache.spark.sql.types._
    
    object SparkTestApp {
        def main(args: Array[String]) = {
            val spark  = SparkSession.builder()
                                     .appName("SparkTestApp")
                                     .enableHiveSupport()
                                     .getOrCreate()

            val df = spark.read.json(args(0))
            df.createTempView("surveys")
            spark.sql("""select age, count(*) as frequency
                         from surveys
                         where age between 20 and 65 group by age""")
                .write
                .saveAsTable("survey_frequency")

            spark.stop()
        }
    }            
                         

If you know Scala, you already know that we can create a main method in a Scala object. Scala classes do not allow us to create a static method. So, we create a Scala object. Then we create a main method.
When we work on a REPL, we already have a spark session object ready to use. However, for a Spark Application, you need to create a Spark session. That is simple to do. We have a SparkSession class and a companion object. You can get the details in the Spark Scala documentation. The SparkSession companion object offers you a builder method that returns a builder object.
The builder object provides a bunch of methods to set various properties for a Spark session and ultimately create a new session. The code shows how we do that. Create a SparkSession companion object. Make a call to the builder method. That gives you a builder object. Then you are ready to set some properties. The first one that I defined is the name of my application. The builder object also offers many variations of config methods. Those config methods allow you to set some Spark configurations. However, I don't want to hardcode configurations in my application until it makes sense to define it from inside the application. You can also specify the spark master node details. But again, I don't want to hardcode. Basically, we want to be able to submit it to any spark cluster and specify spark master at the time of submission.
The next line enables the hive support for this session. Finally, we call the getOrCreate method. We do not have an existing spark session in this application, so the method will create a brand-new spark session. Now the spark session should be available to the value spark. We can use this session to work with the spark cluster in the same way as we worked in the REPL.
I don't want to write a lot of code. However, I wanted to mimic a popular use case. We want to load some data, perform some transformations, and finally save the results as a table to be consumed by the downstream system. That's what we are doing in the above code. Load the data from a JSON file and register it as a temporary view. Then we calculate the age frequency, and write the output as a database table. That's it. The last line is to close the session.
We will compile it and package it as a jar file. Then we will submit it to Spark and go back to Spark SQL command line to check if the survey_frequency table is there. To compile and package the application in a jar file, execute the following sbt command.
sbt package
That's it. The jar file is ready, and it should be available in the target directory.

Submitting a Spark Applications

Now we are ready to submit this application to our spark cluster. You can use the below command to execute It in the local cluster.

                                
    spark-submit --master local --class guru.learningjournal.examples.SparkTestApp target/scala-2.11/spark-test-app-2.11-0.1.jar /home/prashant/spark-data/mental-health-in-tech-survey/json-data/surveys.json  
                         

The name for the tool to submit spark application is the spark-submit. As the first parameter, we tell it about the spark master. We are using a local mode spark cluster, and hence the value is local. If you are using a standalone cluster manager that comes along with spark, you would be using a URI for your master node in the below format.

                                
    --master spark://host:port   
                         

If you are using a Mesos managed spark cluster, you will be using a URI like below.

                                
    --master mesos://host:port    
                         

However, in most likely case, you will be using a YARN managed cluster. I have already explained the spark Application behaviour in a YARN cluster . In a YARN cluster, there is no spark master. We submit our application to YARN, and then the YARN starts an application master in a container. And hence, if you are submitting your application to a YARN cluster, all you need is to say the following.

                                
    --master yarn     
                         

You don't have to specify a master URI. Just tell YARN as we said local. That's it.
Rest of the parameters to spark-submit are straightforward. We specify the object name for our application. Then we specify the location of the application jar file. And finally, pass the command line arguments. That's it. Hit the enter button. It will take few seconds to complete.
Now you can go back to spark SQL command line tool or the zeppelin Notebook and check the output table. Amazing! Isn't it.
Thank you very much for watching Learning Journal. Keep learning and keep growing.



You will also like:


Kafka Core Concepts

Learn Apache Kafka core concepts and build a solid foundation on Apache Kafka.

Learning Journal

Apache Spark Introduction

What is Apache Spark and how it works? Learn Spark Architecture.

Learning Journal

Pattern Matching

Scala takes the credit to bring pattern matching to the center.

Learning Journal

Pure Function benefits

Pure Functions are used heavily in functional programming. Learn Why?

Learning Journal

Immutability in FP

The literal meaning of Immutability is unable to change? How to program?

Learning Journal