In the previous session, we tried to understand HDFS at a high level. In this session, I will cover HDFS architecture and core concepts. This video will take you one level deeper into HDFS. So, let's start.
Hadoop Cluster Architecture
We already know that HDFS is a distributed file system. So, the first thing that comes to my mind is a bunch of computers. We need to network them together and form a cluster. The diagram below shows how a simple but typical Hadoop cluster is networked.
Hadoop Cluster Rack
One column of computers is called one
Rack. The term rack is essential because there are few concepts associated with the
So, let me explain the rack.
The rack is nothing but a kind of a Box. We fix multiple computers into a Rack. Typically, each Rack is given a power supply and a dedicated network switch. So, if the switch fails or there is a problem with the power supply of a rack, all the computers within the rack can go out of network. The point that I am trying to make is that there is a possibility of entire Rack to fail. Just keep this in mind and come back to Hadoop cluster.
So, that is how a typical Hadoop cluster is networked. We have multiple racks, each with their switch. Finally, we connect all these switches to a core switch. So, everything is on a network, and we call it the Hadoop cluster.
Hadoop Master Node
The HDFS is designed using master-slave architecture. In this architecture, there is one master, and all others are slaves. So, let’s assume that one of those computers is the Master, and all others are slaves. The Hadoop master is called the Name Node, and slaves are called Data Nodes. One friend asked me an interesting question, why we call it the name node? Why not just a Master Node or a Super Node or the King Node?
Hadoop Name Node vs Data Node
We call it name node because it stores and manages the Names. The names of directories and the
names of files. The data node
stores and manages the data of the File. So, we call them Data Node.
Let me explain this. Since HDFS is a filesystem, we can create directories and files using HDFS. There are many ways to do it, and we will look at some examples and a demo later, but for now, assume that you are creating a large file in HDFS. The question is, How HDFS stores the file on a cluster? When we create a file in HDFS, what happens under the hood?
How Hadoop Stores a file?
There are three actors there.
- Hadoop Client.
- Hadoop Name Node.
- Hadoop Data Nodes.
The Hadoop client will send a request to Name Node that it wants to create a file. The client
will also supply the target
directory name and the filename. On receiving a request, the Name Node will perform various
like directory already exists, the file doesn’t already exist, and the client has the right
to create a file. Name Node can perform these checks because it maintains an image of entire
namespace into memory (In memory fsImage).
If all the tests pass, the Name Node will create an entry for the new file and return success to the client. The file name creation is over, but it is empty, you haven’t started writing data to the file yet. Now it’s time to start writing data.
So, the Client will create an FSDataOutputStream and start writing data to this stream. The FSDataOutputStream is the Hadoop streamer class, and it internally does a lot of work. It buffers the data locally until you accumulate a reasonable amount of data, let’s say 128 MB. We call it a block. An HDFS data block. Right?
So, once there is one block of data, the streamer reaches out to Name Node asking for a block allocation. It is just like asking the Name node that where do I store this block? The name node doesn't store data. But the name node knows the amount of free disk space at each data node. In fact, it knows the status of all the resources at each data node. With that information, The Name node can easily assign a data node to store that block. So,the Name Node will perform this allocation and send back the data node name to the streamer.
Now the Streamer knows that where to send the data block. That’s it. The Streamer starts sending the block to the data node. If the file is larger than one block, the Streamer will again reach out to Name node for a new block allocation. This time, the Name node may assign some other data node. So, your next block goes to the different data node. Once you finish writing to the file, the name node will commit all the changes.
Hadoop Architecture Summary
I hope you followed this process. Let me summarize some takeaways from this entire discussion.
- HDFS has a master/slave architecture.
- An HDFS cluster consists of a single Name Node and several Data Nodes.
- The Name Node manages the file system namespace and regulates access to files by clients. When I say regulate, I mean checking access permissions and user quotas, etc.
- The Data Node stores file data in the form of blocks.
- Each data node periodically sends a Heartbeat to Name Node to inform that it is alive. This Heartbeat also includes resource capacity information that helps Name Node in various decisions. The data node also sends a block report to name node. The block report is the health information of all the blocks that are maintained by the data node.
- The HDFS will split the file into one or more blocks and store these blocks on different Data Nodes. The name node maintains the mapping of the blocks to the file, their order and all other metadata.
- A typical block size used by HDFS is 128 MB. We can specify block size on per file basis. You should notice that the block size in HDFS is too large compared to local file systems. But It was a crucial design decision to avoid disk seeks. Some cluster setup configures the block size to even greater as 256 MB. However, taking a too big value for the block size may have an adverse impact. We will again visit block size in a later video.
- The Name Node determines the mapping of blocks to Data Nodes. But after mapping, the client directly interacts with the Data Node for reading and writing.
- When a client is writing data to an HDFS file, the data first goes to a local buffer. This approach is adapted to provide streaming read/write capability to HDFS.
- The Name Node and Data Node are pieces of software. So, at the minimum configuration, you can run both on the same machine and create a single node Hadoop cluster. But a typical deployment has a dedicated computer that runs only the Name Node software. Each of the other machines in the cluster runs one instance of the Data Node software.
Okay, So far, we talked about following core architecture elements of HDFS.
- Name Node
- Data Node
- Splitting of files and Data Blocks.
We also talked about
- Block report.
- Block sizes.
- Client-side buffering.
In the next video, I will cover Fault Tolerance and High Availability features of Hadoop.
Thank you for watching learning journal. Keep learning and keep growing.