Apache Spark Foundation Course - Multinode Spark


If you are learning Hadoop and Apache Spark, you will need some infrastructure. You can do a lot on your local machines or try things in a VM on your laptop or a desktop. But sooner or later, you will realize following problems.

  1. You have limited resources to create a single node or a standalone environment. But you want to learn more about a real multi-node cluster.
  2. Installation and setup of Hadoop and Spark is a tedious and time taking process. As an alternative, you can download a ready to use VM image, but they don't offer you a multi-node cluster.

In this video, I will set up a six-node Hadoop and Spark cluster. You can do that in less than two minutes. And most important thing. You don't have to buy expensive machines, download large files, or pay anything to get this cluster.
In the very first video of this tutorial, we created a free account in Google Cloud Platform and created a VM in GCP. You can use your Free GCP account to set up a six-node Hadoop cluster.

Setup Google Cloud DataProc

Google offers a managed Spark and Hadoop service. They call it Google Cloud Data Proc. It is almost same as Amazon's EMR.
You can use Data Proc service to create a Hadoop and Spark cluster in less than two minutes.
The video shows you step by step process. However, the steps are summarized below.

  1. Setup your Google cloud account and a default project.
  2. Start your Google Cloud Console.
  3. Go to products and services menu.
  4. Scroll down to Data Proc and select clusters.
  5. Hit the create cluster button.
  6. Give a name to your Cluster and choose your region.
  7. Choose a machine type for your master node.
  8. Select a cluster type.
  9. Select the disk size for the master node.
  10. Choose the data node configuration and the number of workers.
  11. Select the disk size for each worker.
  12. Select an initialization action (Optional)
  13. Hit the create button.

Wait a minute or two, and the Data Proc API will provision your cluster. You don't need a download, no installation, and nothing. Your Spark cluster should be ready in just a few clicks.


How to access Spark Cluster in Cloud?

You would want to do at least two things.

  1. SSH to one of the nodes in your cluster.
  2. Access Web-based UIs. For example, Resource manager UI.

The first part is simple. Click on the cluster in Google Cloud console, and you should be able to see the list of VMs. You can SSH to the master node.
In fact, if you check your GCP dashboard, you will see your VMs. You can SSH to any of those VMs. You may not want to do that, but the GCP doesn't stop you from accessing your VMs.
Now the next part. How to access Web Based UIs.
All such UIs are available on different ports. You have an option to create a firewall rule and open those ports. But I don't want to recommend that. Because those cluster services are not secure, and you don't want to open several ports for attackers.
There is another secure alternative. SSH tunnel.

How to create SSH tunnel?

It's a two-step process.

  1. Build an SSH tunnel to the master node.
  2. Configure your browser to use SOCKS proxy. The proxy will route the data from your browser through the SSH tunnel.

Follow these steps.

  1. Download and install Google Cloud SDK.
  2. Start the installer and follow the on-screen instructions. The Installer automatically starts a terminal window and runs an init command. It will ask you for your GCP account credentials and the default project name.
  3. Start a terminal and use the following command. The command gcloud compute ssh will open a tunnel from port 10000 on your local machine to zone us-east1-c and the node spark-6-m. You can change the zone name and the master node name based on your cluster setup.
    The -D flag is to allow dynamic port forwarding and -N to instruct gcloud to not to open a remote shell.
  4. After executing the above command, minimize the command windows.
  5. The next step is to start a new browser session that uses the SOCKS proxy through this tunnel. Start a new terminal. Start your browser using the following command.

I am starting chrome.exe with my YARN Resource Manager URL. Next one is the proxy server. It should use the socks5 protocol on my local machine's port 10000. That's the port where we started the SSH tunnel. Right?
The next flag is to avoid any DNS resolves by chrome. Finally, the last option is a non-existent directory path. This option allows chrome to start a brand-new session.
That's it. You can access the resource manager in the new browser.
This video demonstrated to get you a Spark cluster. You can access that over the web and SSH. Execute your Jobs, play with it and later go back to your Data Proc clusters list and delete it. We don't have an option to keep it there in shutdown state. Creating and removing a cluster is as simple as few clicks. You can create a new one every day, use it and then throw it away.


You will also like: