Kerberos Authentication in Hadoop
Authentication is the first level of security for any system. It is all about
validating the identity of a user or a process.
In a simple sense, it means verifying a username and password. In this article, we
will
try to understand the need for secure authentication method and its implementation
in
a Hadoop cluster.
In a secure system, the users and the processes are required to identify
themselves.
Then the system needs to validate the identity. The system must ensure that you are
the
one who you claim to be.
The authentication doesn't end there. Once your identity is validated, it must
flow
further down to the system. Your identity must propagate to the system along with
your
every action and to every resource that you access on the network. This kind of
authentication
is not only needed for users, but it is also mandatory for every process or
service.
In the absence of an authentication, a process or a user can pose itself to be
a
trusted identity and gain access to the data. Most of the systems implement this
capability.
For example, your Linux OS is capable of validating your credentials and
propagating
it further down. Now, coming back to a Hadoop cluster. Why can't Hadoop rely on
Linux
authentication?
Hadoop works on a group of computers. Each computer executes an independent
operating
system. OS authentication works within the boundary of an OS. But Hadoop works
across
those boundaries. So, Ideally, Hadoop should have a network-based authentication
system.
But unfortunately, Hadoop doesn't have a built-in capability to authenticate users
and
propagate their identity. So, the community had following options.
- Develop an authentication capability into Hadoop.
- Integrate with some other system that is purposefully designed to provide the authentication capability over a networked environment.
They decided to go with the second option. So, Hadoop uses Kerberos for
authentication and identity propagation. You may
ask a question here. Why Kerberos? Why not something else like SSL certificates or
OAuth?
Well, OAuth was not there at that time. And they give two reasons over SSL.
- Performance
- Simplicity
Kerberos performs better than SSL, and managing users in Kerberos is much more straightforward. To remove a user, we just delete it from Kerberos whereas revoking an SSL certificate is a complicated thing.
What is Kerberos?
Kerberos is a network authentication protocol created by MIT. It eliminates the
need for transmission of passwords across
the network and removes the potential threat of an attacker sniffing the network.
To understand the Kerberos protocol and how it works, You must realize few
jargons
and components of the Kerberos system. Let me introduce you to all of them.
The first one is KDC. We call it the Key Distribution Center. KDC is the
authentication
server in Kerberos environment. Most of the cases, it resides on a separate
physical
server. We can logically divide the KDC into three parts.
- A Database
- An Authentication Server (AS)
- A Ticket Granting Server (TGS)
The database stores user and service identities. These identities are known as
principals. KDC database also stores other
information like an encryption key, ticket validity duration, expiration date, etc.
The Kerberos Authentication Service authenticates the user and issues a TGT
ticket.
If you have a valid TGT, means AS has verified your credential.
TGS is the application server of KDC which provides service ticket. Before
accessing
any service on a Hadoop cluster, you need to get a service ticket from TGS.
How Kerberos authentication works?
Let's assume you want to list a directory from HDFS on a Kerberos enabled Hadoop cluster.
1. First thing, you must be authenticated by Kerberos. On a Linux machine, you can do it by executing the kinit tool. The kinit program will ask you for the password. Then, it will send an authentication request to Kerberos Authentication Server.
2. On a successful authentication, the AS will respond back with a TGT.
3. The kinit will store the TGT in your credentials cache. So, now you have your TGT that means, you have got your authentication, and you are ready to execute a Hadoop command.
4. Let's say you run following command.
hadoop fs --ls /
So, you are using Hadoop command. That's a Hadoop client. Right?
5. Now, the Hadoop client will use your TGT and reach out to TGS. The client approaches TGS to ask for a service ticket for the Name Node service.
6. The TGS will grant you a service ticket, and the client will cache the service ticket.
7. Now, you have a ticket to communicate with the Name Node. So, the Hadoop RPC will use the service ticket to reach out to Name Node.
8. They will again exchange the tickets. Your Ticket proves your identity and Name node's Ticket determines the identification of the Name Node. Both are sure that they are talking to an authenticated entity. We call this a mutual authentication.
9. The next part is authorization. If you have permissions to list the root
directory, the NN will return the results to
you. That's all about Kerberos Authentication in Hadoop.
You might be interested in a step by step demo for setting up a Kerberised
cluster.
I have a video tutorial to do the same thing. Check out my Hadoop foundation
training
videos.