Hadoop/Spark in Google cloud
If you are learning Hadoop and Apache Spark, you will need some infrastructure. You can do a lot on your local machines or try things in a VM on your laptop or a desktop. But sooner or later, you will realize following problems.
- You have limited resources to create a single node or a standalone environment. But you want to learn more using a real multi-node cluster.
- Installation and setup of Hadoop and Spark is a tedious and time taking process. As an alternative, you can download a ready to use VM image, but they does not offer you a multi-node cluster.
In this article, I will set up a six-node Hadoop and Spark cluster. I will do it in
less than two minutes. And most important
thing. You don't have to buy expensive machines, download large files, or pay
to get this cluster.
This article is also available as a video tutorial.
In the earlier article, we created a free account in Google Cloud Platform and also created a VM in GCP. You can use your Free GCP account to set up a six-node Hadoop cluster.
Provision a cluster
Google offers a managed Spark and Hadoop service. They call it Google Cloud Dataproc. It is almost same as Amazon's EMR. You can use Dataproc service to create a Hadoop and Spark cluster in less than two minutes. Let me outline a step by step process.
- I am assuming that you have already setup your Google cloud account and a default project as explained in the first part of this article.
- Start your Google Cloud Console. Go to products and services menu. Scroll down to Data Proc and select Clusters.
- Hit the create cluster button and give a name to your cluster.
- Choose your region. The default selection is global. We don't want our cluster to spread across the globe. I want to limit all my nodes in a single region and a single zone.
- Choose a machine type for your master node. With your free GCP account, you cannot have more than eight running CPU cores at a time. The paid account doesn't have such limits. But eight CPU cores are good enough for us. I will give two CPU cores to the master node.
- The next option is to select a cluster type. We have three choices.
- A single node cluster.
- A standard multi-node cluster.
- A High availability multi-node cluster.
- The last option (High availability multi-node cluster) will create three master nodes and enable High Availability. We will go with the second option (standard multi-node cluster).
- You should reduce the disk size to 32 GB. We don't need more than that on a machine.
- Choose the data node configuration. You should pick up a single CPU for each worker and take five workers.
- Reduce the disk size to 32 GB for each worker.
- The selected configuration gives you a six-node Hadoop cluster. One master and five workers. Hit the create button, and the Dataproc will launch a six-node cluster in less than two minutes. The new Cluster comes with Hadoop and Spark.
No downloads, No installation, nothing. Your cluster should be ready in just a few minutes. We used Cloud console UI to create this cluster, but you can use Cloud SDK to create a script and launch it by just executing a script file from your local machine. You can also achieve the same results using REST APIs. The SDK and the APIs are the primary tools for automating things in Google Cloud.
How to access Hadoop/Spark cluster?
Now the next part of it. How to access and use the cluster that we just created. You would want to do at least two things.
- SSH to one of the nodes in this cluster.
- Acces Web-based UIs (For example resource manager UI)
The first part is simple. Click on the cluster, and you should be able to see the
list of VMs. The first one listed there
is the hostname of your master, and rest of them are the worker nodes. You can ssh
the master node.
In fact, if you check your dashboard, you will see your VMs. You can ssh to any of those VMs. You may not want to do that, but the GCP doesn't stop you from accessing your VMs. Once you ssh to the master node, you will realize that these VMs are Debian. So, your CentOS or Redhat specific commands may not work. But that shouldn't be a problem.
Once you ssh into your VM, you can access HDFS using standard Hadoop shell commands. You can also start Spark shell. You can submit a spark job using spark-submit utility.
Check out my Spark Tutorials for more on Spark. I am using this cluster in many of my Spark tutorial videos.
How to access web based UIs?
All of the cluster UIs are available on different HTTP ports. You have an option to create a firewall rule and open those ports. However, I don't recommend you to do that. Because these services are not secure and you don't want to open several ports for attackers. Right? There is another easy alternative as explained below.
It's a two-step process.
- Create an SSH tunnel to the master node.
- Configure your browser to use SOCKS proxy. The proxy will route the data from your browser through the SSH tunnel.
Follow these steps to complete both of these activities.
- Install Google Cloud SDK. The link to download the SDK is available here.
- Once downloaded, start the SDK installer on your local machine and follow the on-screen instructions. The Installer automatically starts a terminal window and runs an init command. It will ask you for your GCP account credentials and the default project name.
- Once you have the SDK, you should be able to use gcloud and gsutil command line tools.
- Use below command to create an SSH tunnel.
gcloud compute ssh --zone=us-east1-c --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" "spark-6-m"
The command gcloud compute ssh will open a tunnel from port 10000 on your local machine to zone us-east1-c and the node spark-6-m. You can change the zone name and the master node name based on your actual cluster setup. The -D flag is to allow dynamic port forwarding and -N is to instruct gcloud that it should not open a remote shell.
- The above command should launch a new window. Minimize all the windows.
- The next step is to start a new browser session that uses the SOCKS proxy through the tunnel. Start a new terminal. Start your browser using the command shown below.
"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" "http://spark-6-m:8088" --proxy-server="socks5://localhost:10000" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp/spark-6-m
I am starting chrome.exe with a URL (http://spark-6-m:8088). This URL is my YARN Resource Manager. Next item in the above command is the proxy server. It should use the socks5 protocol on your local machine's port 10000. That's the port where you started the SSH tunnel. Right? The next flag is to avoid any DNS resolves by chrome. Finally, the last option is a non-existent directory path. This option allows chrome to start a brand new session.
- The above command should launch a new chrome browser session. You should be able to access the resource manager UI.
We started the new browser window using a tunnel to the master
and hence, you can use any UI that is available over the master node by just
Great. You have the cluster. You can access it over the web and SSH. Execute your Jobs, play with it and later go back to your Dataproc clusters list and delete it. We don't have an option to keep it there in shutdown state.
Creating and removing a cluster is as simple as few clicks. You can create a new one every day, use it and then throw it away.
Where do I keep my data?
You might be wondering about your data. If you delete a cluster, what happens to
your data. Well, If you stored it in HDFS,
you lose it.
That's a fundamental architectural difference in elastic clusters and persistent clusters. Even if you are using Amazon's EMR, you don't keep data in HDFS. You would be keeping it in S3 or somewhere else.
In case of Google Dataproc, you have many options. If you have a data file, you will keep it in GCS Bucket (Google Cloud Storage Bucket). Check out my Spark training for more details. I am using GCS buckets in my Spark tutorials.