Apache Spark Foundation Course - Single Node Spark setup


In this video, I will help you to setup your environment and get access to the latest and greatest version of Apache Spark. There are many ways to do this, and I will cover some convenient and modern methods. We will set up a standalone Spark on your local machine. We will go further and integrate Jupyter notebook for Scala,Python, and Spark SQL. By the end of this video, you will be able to achieve following things.

  1. Download and Install Apache Spark on your Linux machine.
  2. Access Spark from Spark Shell - Scala Shell.
  3. Access Spark from PySpark– Python.
  4. Install Jupyter notebooks - web interface to Spark.
  5. Install and Configure Apache Toree -JupyterKernal for Spark.
  6. Access Spark from Jupyter Notebook - Scala, Python, and Spark SQL.

I will execute one simple example to show you the whole process. If you are new to Notebooks? Don't worry. I will cover that as well.
Great. Let's start.

Install Apache Spark

You need a Linux machine with JDK 8 installed. You can download latest JDK rpm from Oracle Technet and install it using Yum. I recommend downloading an RPM. You can use following command.

                                
    wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u144-b01/090f390dda5b47b9b721c7dfaa008135/jdk-8u144-linux-x64.rpm     
    yum localinstall jdk-8u121-linux-x64.rpm                                         
                         

If you don't have a Linux machine, check out my Google Cloud tutorials.You can get a free VM on Cloud.
You can download the latest version of Apache Spark from the official downloads page. If you are using a VM in google cloud, you can use below command to directly download the Spark to your Cloud VM.

                                
    wget -c https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz                                         
                        

Apache Spark installation is as simple as extracting the contents of the file. You can use following commands.

                                
    mkdir spark
    tar -zxvf spark-2.2.0-bin-hadoop2.6.tgz -C spark/                                       
                        

If you check the spark directory, you should see a subdirectory. The absolute path of the subdirectory is your Spark home. You shouldset this up as an environment variable.

                                
    vi .bash_profile
    export SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7
    export PATH=$PATH:$SPARK_HOME/bin                                            
                        

Great. You are ready to start learning Spark. You don't need anything else. This setup is enough for learning purpose.You can start Spark shell using below command, load some data and execute some queries.

                                
    spark-shell
    val df = spark.read.json("data/people.json")
    df.filter("age > 21" ).select("name","age").show()                                          
                         

If you want, you can also use SQL.

                                
    df.createOrReplaceTempView("people")
    spark.sql("SELECT * FROM people where age > 21").show()                                     
                        

You can start a Python shell and do the same thing using Python.

                                
    pyspark
    df = spark.read.json("data/people.json")
    df.filter("age > 21").select("name","age").show()
    //You can use SQL
    df.createOrReplaceTempView("people")
    spark.sql("SELECT * FROM people where age > 21").show()                                         
                        

Great. Did you notice that the Scala syntax and Python Syntax are almost same? That's because Spark is a language in itself. We will learn more about it as we progress. I will try to cover Scala and Python both. But I still recommend that you learn Scala. Check out my Scala tutorials
Okay. I showed you first three items from the list.The next three items are to configure a notebook for your Spark.


Spark Notebooks

Notebooks are already popular. If you haven't used it earlier, you might be wondering with following questions.

  1. What are they?
  2. Why should I bother about them?

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents. These documents are not a dumb text document. You can include following things in Notebook.

  1. Live running code
  2. Interactive widgets
  3. Plots and Graphs
  4. Narrative text
  5. Images
  6. Video

So, whatever code, plots, and examples that I am going to use in my Spark Tutorials. I can publish it as a Notebook for you. You can use it, Execute it as it is, modify it and also add your comments and notes inline. You will realize the benefits as soon as you start seeing it live in this tutorial.
Notebooks are live documents and are an excellent tool for collaboration among people in your team.
Great. Let's install Jupyter Notebooks. Go to the Jupyter website, and you will notice that they recommend installing Anaconda.
Anaconda is an Opensource Python distribution, and it comes bundled with many things including Jupyter. You install Ananconda, and you get Jupyter as well. So, download Anaconda for Linux. If you are using Cloud VM, you can use below command to directly download it to your VM.

                                
    wget -c https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh                                         
                        

You can install it by executing following command.

                                
    bash Anaconda3-5.0.0.1-Linux-x86_64.sh                                         
                        

Apache Toree

Installing Jupyter alone is not enough.You need a Jupyter Kernal to integrate Jupyter Notebook with Apache Spark. As on date, we have a couple of options.

  1. Spark magic
  2. Apache toree

We will use Apache Toree. The latest version of Toree is 0.1.0. It doesn't work with Spark 2. However, you can download a Toree dev build.
Installing toree is a two-step process. Use following command.

                                
    pip install toree-0.2.0.dev1.tar.gz
    jupyter toree install --spark_home=$SPARK_HOME --interpreters=Scala,PySpark,SQL --user                                    
                        

The above command configures Scala, Python and Spark SQL.The last option --user is to restrict the installation for the current user. If you don't give this option, you might end up in a permission related problem.

Starting Spark Jupyter Notebook in Local VM

Now all that we need to do is to start a jupyter notebook. Create a working directory for yourself. Go to your working directory and start a Jupyter Notebook. If you are using a local Linux machine, you can start Jupyter Notebook using below command.

                                
    jupyter notebook  --no-browser                                           
                        

The default IP is the local host and default port is 8888. The above command is using the defaults. The last option is to make sure that the Jupyter doesn't automatically launch a browser. The output of the above command should display a URL. Copy that URL and paste it into your browser. That's it. You should see the Jupyter Notebook.

Starting Spark Jupyter Notebook in Cloud VM

If you are using a cloud VM, above command would not work for you.There are two extra steps.

  1. Upgrade your VM’s external IP address to a static IP.
  2. Add a firewall rule to open TCP 8888 port.

Please checkout the video tutorial for a step by step process to do above things.
Once you complete those steps, start the Jupyter notebook using below command.

                                
    jupyter notebook --ip=0.0.0.0  --port=8888 --no-browser                                          
                        

The Jupiter server will give you a URL. Copy the URL and replace the 0.0.0.0 by your VM's external IP address. Paste the new URL into your browser.That's it. You can start a new Scala notebook.
If you are new to Jupyter Notebooks, The video gives you a quick introduction and shows some basic operations.
You can also access How to Use Jupyter Notebooks from Jupyter website.


You will also like:


Pure Function benefits

Pure Functions are used heavily in functional programming. Learn Why?

Learning Journal

Apache Spark Introduction

What is Apache Spark and how it works? Learn Spark Architecture.

Learning Journal

Referential Transparency

Referential Transparency is an easy method to verify the purity of a function.

Learning Journal

Scala Variable length arguments

How do you create a variable length argument in Scala? Why would you need it?

Learning Journal

Free virtual machines

Get upto six free VMs in Google Cloud and learn Bigdata.

Learning Journal