In this video, I will help you to setup your environment and get access to the latest and greatest version of Apache Spark. There are many ways to do this, and I will cover some convenient and modern methods. We will set up a standalone Spark on your local machine. We will go further and integrate Jupyter notebook for Scala,Python, and Spark SQL. By the end of this video, you will be able to achieve following things.
- Download and Install Apache Spark on your Linux machine.
- Access Spark from Spark Shell - Scala Shell.
- Access Spark from PySpark– Python.
- Install Jupyter notebooks - web interface to Spark.
- Install and Configure Apache Toree -JupyterKernal for Spark.
- Access Spark from Jupyter Notebook - Scala, Python, and Spark SQL.
I will execute one simple example to show you the whole process. If you are new to Notebooks? Don't worry. I will cover that
Great. Let's start.
Install Apache Spark
You need a Linux machine with JDK 8 installed. You can download latest JDK rpm from Oracle Technet and install it using Yum. I recommend downloading an RPM. You can use following command.
wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u144-b01/090f390dda5b47b9b721c7dfaa008135/jdk-8u144-linux-x64.rpm yum localinstall jdk-8u121-linux-x64.rpm
You can download the latest version of Apache Spark from the official downloads page. If you are using a VM in google cloud, you can use below command to directly download the Spark to your Cloud VM.
wget -c https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
Apache Spark installation is as simple as extracting the contents of the file. You can use following commands.
mkdir spark tar -zxvf spark-2.2.0-bin-hadoop2.6.tgz -C spark/
If you check the spark directory, you should see a subdirectory. The absolute path of the subdirectory is your Spark home. You shouldset this up as an environment variable.
vi .bash_profile export SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin
Great. You are ready to start learning Spark. You don't need anything else. This setup is enough for learning purpose.You can start Spark shell using below command, load some data and execute some queries.
spark-shell val df = spark.read.json("data/people.json") df.filter("age > 21" ).select("name","age").show()
If you want, you can also use SQL.
df.createOrReplaceTempView("people") spark.sql("SELECT * FROM people where age > 21").show()
You can start a Python shell and do the same thing using Python.
pyspark df = spark.read.json("data/people.json") df.filter("age > 21").select("name","age").show() //You can use SQL df.createOrReplaceTempView("people") spark.sql("SELECT * FROM people where age > 21").show()
Great. Did you notice that the Scala syntax and Python Syntax are almost same? That's because Spark is a language in itself.
We will learn more about it as we progress. I will try to cover Scala and Python both. But I still
recommend that you learn Scala. Check out my
Okay. I showed you first three items from the list.The next three items are to configure a notebook for your Spark.
Notebooks are already popular. If you haven't used it earlier, you might be wondering with following questions.
- What are they?
- Why should I bother about them?
The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents. These documents are not a dumb text document. You can include following things in Notebook.
- Live running code
- Interactive widgets
- Plots and Graphs
- Narrative text
So, whatever code, plots, and examples that I am going to use in my Spark Tutorials. I can publish it as a Notebook for you.
You can use it, Execute it as it is, modify it and also add your comments and notes inline. You will
realize the benefits as soon as you start seeing it live in this tutorial.
Notebooks are live documents and are an excellent tool for collaboration among people in your team.
Great. Let's install Jupyter Notebooks. Go to the Jupyter website, and you will notice that they recommend installing Anaconda.
Anaconda is an Opensource Python distribution, and it comes bundled with many things including Jupyter. You install Ananconda, and you get Jupyter as well. So, download Anaconda for Linux. If you are using Cloud VM, you can use below command to directly download it to your VM.
wget -c https://repo.continuum.io/archive/Anaconda3-184.108.40.206-Linux-x86_64.sh
You can install it by executing following command.
Installing Jupyter alone is not enough.You need a Jupyter Kernal to integrate Jupyter Notebook with Apache Spark. As on date, we have a couple of options.
- Spark magic
- Apache toree
We will use Apache Toree. The latest version of Toree is 0.1.0. It doesn't work with Spark 2. However, you can download a
Toree dev build.
Installing toree is a two-step process. Use following command.
pip install toree-0.2.0.dev1.tar.gz jupyter toree install --spark_home=$SPARK_HOME --interpreters=Scala,PySpark,SQL --user
The above command configures Scala, Python and Spark SQL.The last option --user is to restrict the installation for the current user. If you don't give this option, you might end up in a permission related problem.
Starting Spark Jupyter Notebook in Local VM
Now all that we need to do is to start a jupyter notebook. Create a working directory for yourself. Go to your working directory and start a Jupyter Notebook. If you are using a local Linux machine, you can start Jupyter Notebook using below command.
jupyter notebook --no-browser
The default IP is the local host and default port is 8888. The above command is using the defaults. The last option is to make sure that the Jupyter doesn't automatically launch a browser. The output of the above command should display a URL. Copy that URL and paste it into your browser. That's it. You should see the Jupyter Notebook.
Starting Spark Jupyter Notebook in Cloud VM
If you are using a cloud VM, above command would not work for you.There are two extra steps.
- Upgrade your VM’s external IP address to a static IP.
- Add a firewall rule to open TCP 8888 port.
Please checkout the video tutorial for a step by step process to do above things.
Once you complete those steps, start the Jupyter notebook using below command.
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser
The Jupiter server will give you a URL. Copy the URL and replace the 0.0.0.0 by your VM's external IP address. Paste the
new URL into your browser.That's it. You can start a new Scala notebook.
If you are new to Jupyter Notebooks, The video gives you a quick introduction and shows some basic operations.
You can also access How to Use Jupyter Notebooks from Jupyter website.