Apache Hadoop Foundation Course - Hadoop Architecture Part-2


In this video, I want to cover fault tolerance and high availability features of Hadoop.
Let’s start with Fault Tolerance.

Hadoop Fault Tolerance

Let me ask you a simple question. If some data node fails, what will happen to your file? I mean, your data file was broken into blocks, and you stored them on different data nodes. If one of such data node is not working, how will you read your file? You can’t read it right because you lost some part of your file with that faulty machine.
The Hadoop offers a straightforward solution to this problem. Create a backup copy of each block and keep it on some other data node. That’s it. If one copy is not available, you can read it from the second copy. In Hadoop's terminology, it is called Replication Factor. We can set replication factor on file to file basis, and we can change it even after creating a file in HDFS. So, if I configure the replication factor of a file as 2, the HDFS will automatically make two copies of each block for this file. Hadoop will also ensure that it keeps these two copies on two different machines. We typically set the replication factor to 3. Making three copies is reasonably good. However, if your file is super critical, you can increase the replication factor to some higher value.
Now, let’s come back to our cluster architecture. Let’s assume that we have three copies of the file on three different nodes as shown in the figure.

Fault Tolerance in Hadoop
Fig.1- Fault Tolerance in Hadoop

You may ask a question. What will happen if the entire Rack fails? All three copies are gone. Isn’t it? Hadoop offers a solution to this problem as well. You can configure Hadoop to become rack-aware. Once you set rack awareness, Hadoop will ensure that at least one copy is placed in a different Rack to protect you from Rack failure.
HDFS takes this replication factor seriously. I mean, if you set replication factor as 3 and HDFS created three copies as shown in the figure.

Hadoop Replication Factor
Fig.2- Hadoop Replication Factor

Now, let’s assume the one node fails. One failure leaves you with two copies, another crash will leave you with just one copy, but you wanted HDFS to maintain three copies. We already know that each data node sends a periodic heartbeat to Name Node. Name node will soon discover that this particular data node is not sending heartbeat, so it must have failed. In such situations, Name node will initiate replication of the block and bring it back to 3 replicas.
The point that I want to make is that the Name Node continually tracks the replication factor of each block and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons:


  1. A Data Node become unavailable
  2. A replica becomes corrupted
  3. A hard disk on a Data Node failed
  4. You increased the replication factor of a file

The sole purpose of the replication is to provide you protection against failures, but it costs you a lot of space. Making three copies of your file reduces the storage capacity of your cluster by one-third and increases cost. Hadoop 2.x offers storage policies to minimize the price, and Hadoop 3.x provides erasure encoding as an alternative to replication. I will cover both features in a later video.
However, replication is a traditional method for avoiding faults and cost concerns are not too high because disks are reasonably cheap.

High Availability in Hadoop

The next thing that I want to cover is the High Availability. Let me ask you some questions.

  1. What is High Availability? If you don’t know the answer, there is no point in learning HA feature in Hadoop.
  2. If you know HA, can you explain how it is different than fault tolerance?

Puzzled? Let me explain.
You already know that HA refers to the uptime of the system. It shows the percentage of time the service is up and operational. Every enterprise desires a 99.999% uptime for their critical systems.
The faults that we discussed earlier like a data node failure or a disk failure or even for that matter a rack failure, those faults don’t bring the entire system down. Your Hadoop cluster remains available. Some files may not be available due to those faults, but we learned that replication is the solution for those faults. So, we already have protection against data node failures.

What happens when the Name Node fails?

We already learned that name node maintains the filesystem namespace. Name node is the only machine in a Hadoop cluster that knows the list of directories and files. It also manages the file to block mapping. Every client interaction starts with name node, so if name node fails, we can't use Hadoop cluster. We can't read or write anything to the cluster. So, the name node is the single point of failure in Hadoop cluster.
To achieve high availability for a Hadoop cluster, we need to protect it against name node failures. The question is how to do it?

How to achieve High Availability in Hadoop?

The solution to protect yourself from any failure is the backup. That's it. In this case, we need to make a backup of two things.

  1. HDFS namespace information.
    All the information that a name node maintains should be continuously backed up at some other place. So that in the case of a failure, you have all the necessary information to start a new name node on a different machine.
  2. Standby Name node machine.
    To minimize the time to start a new name node, we should already have a standby computer preconfigured and ready to take over the role of name node.

Now, let's come to the namespace information backup. We already learned that name node maintains the entire file system in memory and we call it in memory fsImage. Name node also maintains an edit log in its local disk. Every time name node makes a change in the filesystem. It records that change in the editLog. The editLog is like a journal ledger of name node. If we have the editLog, we can reconstruct the in-memory fsImage. So, we need to make a backup of name node editLog.
But the question is where and how?


Hadoop Quorum Journal Manager

There are many options, but the best solution offered by Hadoop 2.x architecture is QJM. We call it Quorum Journal Manager. The QJM is a set of at least three machines. Each of these three devices is configured to execute a JournalNode daemon. JournalNode a very lightweight software so you can pick any three computers from your existing cluster. We don't need a dedicated machine for JournalNode. Once you have QJM, the next thing is to configure name node to write editLog entries to the QJM instead of writing it to the local disk.
You might be wondering why we have three Journal Nodes in a QJM?
That gives us double protection. The editLog is so critical that we don't want a backup at only one other place. That's why we have three. In case you need higher protection, you can have a QJM of five or seven nodes.
That's all about making a backup of namespace information.
Let's move to Standby name node.

Standby Hadoop Name Node

So, we add a new machine to the cluster and make it a standby name node. We also configure it to keep reading the editLog from the QJM and keep itself updated. This configuration makes standby ready to take up the active name node role in just a few seconds.
There are two other important things in an active-standby configuration.
All the data nodes are configured to send the block report to both name nodes. Block Report is a kind of health information for the blocks maintained by the data node.
The final question for the HA configuration is How standby knows that active name node failed, and it should take over the active name node role?

Hadoop Zookeeper failover controller

We achieve this by placing a zookeeper and two failover controllers on each name node. The ZKFC of the active name node maintains a lock in zookeeper. The standby name node keeps trying to get the lock, but since active name node already maintains it, the standby never receives that lock. In case, the active fails or crashes, the lock acquired by active name node expires, and the standby succeeds to get a lock. That's it. As soon as Standby gets the lock, it starts to transition from standby to the active name node.

Hadoop high availability using qjm
Fig.3- Hadoop high availability using QJM

What is Secondary Hadoop Name Node

There is another component of HDFS which is worth mentioning at this point.
Secondary name node.
Secondary name node is often confused with standby name node. As I explained in this session, a standby is a backup for the name node. In the case of a name node failure, the standby should take over and perform name node responsibilities. However secondary name node takes care of a different responsibility.
Let me explain.
We already learned about following two things.

  1. In memory fsImage
  2. edit logs

The in-memory fsImage is the latest and updated picture of the Hadoop file system namespace. The edit log is the persistent copy of all the changes made to the file system. Correct?
Let me ask you a question.
What will happen if you restart your name node?
You may need to reboot the name node due to some maintenance activity. On a restart, the name node will lose the fsImage because it is an in-memory image. Right? That shouldn't be a problem because we also have editLog. The name node will read editLog and create a new fsImage. There isn’t any problem with this approach except one.
The problem is with the time to read the editLog. The log keeps growing bigger and bigger every day. The size of log directly impacts the restart time for a name node. We don’t want our name node to take an hour to start just because it is reading edit log and making a picture of the latest filesystem state.

Hadoop Secondary Name Node Checkpoint

The secondary name node is deployed to solve this problem. The secondary name node performs a checkpoint activity every hour. During the checkpoint, the secondary name node will read the edit log, create the latest filesystem state and save it to disk. This state is exactly same as the in-memory fsImage. We call it on disk fsImage. Once we have the on-disk fsImage, the secondary name node will truncate the edit log because all the changes are already applied. Next time, I mean after an hour, the secondary name node will read the on-disk- fsImage and apply all the changes from the edit log that we accumulated during the last one hour. It will then replace the old on-disk- fsImage with the new one and truncate the edit log once again.
So, the checkpoint activity is nothing but a merging of an on-disk-fsImage and Editlog. The Checkpoint doesn’t take much time because we have just an hour old edit log to apply to the on-disk fsImage. The time to read fsImage is short because the fsImage is small compared to the Editlog, and I hope you understand the reason. The Editlog is like a journal ledger that records every transaction. The fsImage is like a balance sheet that shows the final state. So, the fsImage is small.
Great!So, In the case of a restart, the name node also performs the same checkpoint activity that it can finish in a short time.
I hope you understand this process. The whole purpose of secondary name node and checkpoint is to minimize the restart time for the name node.

What is Hadoop Secondary Name Node
Fig.4- What is Hadoop Secondary Name Node

The secondary name node service is not required when we implement Hadoop HA configuration. The standby name node also performs the checkpoint activity, and hence we don’t need secondary name node in HA configuration.
I talked about the various HDFS architecture features and tried to explain how they work. A good understanding of architecture and functionalities take you a long way in designing and implementing a practical solution. The actual setup and configuration specifics are for admin and operations people. Since our focus is for developer and architect, we skip those things however I will cover some basics of installation and configuration in following videos.
Thank you for watching learning journal. Keep learning and keep growing.


You will also like: