Apache Kafka Foundation Course - Fault Tolerance


Welcome to Apache Kafka tutorial at Learning journal. In this session, we will cover fault tolerance in Apache Kafka.
Do you understand the term, the fault tolerance?

What is fault tolerance?

In the previous session, we learned that Kafka is a distributed system and it works on a cluster of computers. Most of the time, Kafka will spread your data in partitions over various systems in the cluster. So, if one or two systems in a cluster fail, what will happen to your data? Will you be able to read it?
Probably not. That's a fault. Can we tolerate it?
The term fault tolerance is very common in distributed systems. It means, making your data available even in the case of some failures.
How to do it?
One simple solution is to make multiple copies of the data and keep it on separate systems. So if you have three copies of a partition, and Kafka stores them on three different machines, you should be able to avoid two failures. Since you have three copies on three different systems, even if two of them fails, you can still read your data from the third system.

Replication Factor

There is a particular term used for making multiple copies. We call it replication factor. So, if I say, replication factor is three, that means, I am maintaining three copies of my partition. If I say replication factor is two, that means we are keeping two copies of a partition. The replication factor of three is a reasonable number. You can even set it to higher if your data is supercritical or you are using cheap machines.
So, Kafka implements fault tolerance by applying replication to the partitions. We can define replication factor at the Topic level. We don't set a replication factor of partitions, we set it for a Topic, and it applies to all partitions within the Topic.

Understanding Kafka Replication

You may want to understand how it works in Kafka. I mean, How Kafka make these copies? Let me explain that as well.
Kafka implements a leader and follower model.
So, for every partition, One Broker is elected as a leader. And the Leader takes care of all client interactions. What does that mean? That means when a producer is willing to send some data. It connects to the Leader and starts sending data. It is Leader's responsibility to receive the message, store it in local disk and send back an acknowledgment to the producer.
Similarly, when a consumer is willing to read data, it sends a request to the leader. It is leader's responsibility to send requested data back to the consumer.
For every partition, we have a leader, and the leader takes care of all requests and responses. I hope that part is clear.
You may be wondering, that in all the above explanation, we haven't made any copy. That's where the followers come into play. So, if we create a topic with the replication factor set to three, A leader of the Topic is already maintaining the first copy. We need two more copies. So, Kafka will identify two more brokers as followers to make two copies. These Followers will copy the data from a leader. They don't talk to producer or consumer. They just copy data from a Leader. Simple, isn't it?
Can we see all this happening? Yes, let me show you a leader and followers in action so that you get a better understanding of these concepts.


Multi-node Kafka Cluster

To be able to demonstrate one Leader and two Followers, I need a three-node Kafka cluster. In an ideal Cluster, we install one Broker on one computer. But for a demonstration or a development activity, we can start multiple Brokers on a single machine. So let's do it.
I already started a Kafka cluster with my first Broker in an earlier video. Now, I am going to start two more brokers on the same machine. I hope you remember the command line tool that we used to start a Kafka broker. We used kafka-server-start.sh, and we passed the server-properties file as a parameter to the tool. We will follow the same method to start two more brokers. But before that, we will make a copy of the Broker config file and modify it. That's necessary to start new Brokers. We can't start multiple Brokers using same properties.
So, let's make two more copies of the original properties file and modify them. Then we will use these modified files to start two more Brokers.

After executing the above commands, you should have two more property files. Now, I want to change three properties in these files. Let me explain those properties, and then, we will go ahead and modify them.

  1. Broker id - The first property is Broker id. It's a unique identifier for the Broker. The default values for the first broker is zero, so we will change it to 1 for the second broker, and 2 for the third Broker. This change is to provide a unique identification to each broker.
  2. Broker port - The next property is the Broker port. It's a network port number to which Broker will bind itself. The Broker will use this port number to communicate with producers and consumers. We will just increment it to whatever the default value is there. In fact, when you start brokers on separate systems, you don't need to change this port number, but since we are starting them on a single machine, we need to change it. Otherwise, all brokers will start reading and writing on the same port number.
  3. Broker log directory - The third property is the Broker log directory. The Broker log directory is the main data directory of a Broker. We don't want all of the brokers to write into the same directory, so we need to change this value as well.

Great, go ahead and modify these properties in the files and prepare a new file for other two brokers.
You can start two more Brokers using those two new property files.


Great, you should have a three node Kafka cluster running.
We did all of this because I wanted to create a topic with a replication factor three, and show you the leader and the follower for each partition. So, let's do that.

Kafka Topic Leader and Follower

You already know how to create a topic. Create a new topic with replication factor 3. We also make sure that we have atleast two partitions.

The kafka-topics.sh is a great tool to manage a Kafka Topic. We call it Topic management tool. This tool also provides a describe command.

The output of the above command shows you the topic name. Then it tells you the number of partitions in that Topic. It shows replication factor for the Topic. It tells everything that you want to know about a Topic. Since we have two partitions on this Topic, it displays two rows, one for each partition. The video explains all this visually.
The first item is the partition id, and the id for the first partition in our example is 0, and it is 1 for the second partition.
The next information is the leader. So, for the first partition, Broker 1 is the leader. What does that mean? That means that the Broker 1 will store and maintain the first copy of this partition and it will also fulfil all client requests for this partition. Similarly, the Broker 2 is the leader of the second partition.
Let's come to the next information. The next column shows the list of the replicas. For the first partition, you will see three copies, Broker 1 maintains the first copy, and that one is the leader also. Broker 2 manages second copy, and Broker 0 holds the third copy. The Broker 2 and Broker 0 are the followers.

What is the ISR?

The ISR is a list of In Sync Replicas. You might have three copies, but one of them may not be in sync with the leader. So, The ISR shows the list of replicas that are in sync with the Leader. In our case, all three are in sync.

Good. So that's it for this session. We covered replication and fault tolerance in this video. We also learned to start three Brokers on a single machine.


You will also like: