Welcome to Kafka tutorials at Learning Journal. I hope you are following this training from the beginning. We have covered
most of the basics of Kafka and explored Kafka producers in detail. Now it is time to explore consumer
side of it.
In this session, I will talk about consumer groups. I have already covered consumer groups. So, I assume that you have a fair idea about it. But before we start creating different types of consumers, it is necessary to understand some nuances of a consumer group. So, let's start.
If your producers are pushing data to the topic at a moderate speed, a single consumer may be enough to read and process that data. However, if you want to scale up your system and read data from Kafka in parallel, you need multiple consumers reading your topic in parallel. Many applications may have a clear need for multiple producers pushing data to a topic at one end and multiple consumers reading and processing data on the other end.
There is no complexity at the producer side. It is as simple as executing another instance of a producer. There is no coordination or sharing of information is needed among producers.
But on the consumer side, we have various considerations. Let's discuss those factors and understand the solution that Kafka provides. Let’s start with the first question.
Kafka Consumer Group?
When I talk about parallel reading, I am speaking about one single application consuming data in parallel. It is not about
multiple applications reading same Kafka topic in parallel.
So, the question is, how to implement parallel reads in a single application.
I think you already know the answer.
We can do that by creating a group and starting multiple consumers in the same group. That part is simple. We will see some code examples for creating multiple consumers in the same group. But that part is straight forward.
However, there is a concern for duplicate reads.
If we have multiple consumers reading data in parallel from the same topic, don't you think that all of them can read the same message?
The answer is no. Kafka provides a very simple solution for this problem. Only one consumer owns a partition at any point in time. What does that mean? Let's take an example to understand this.
We have one topic, and there are four partitions. So, if we have only one consumer in a group, it reads from all four partitions. If you have two, each of them reads two partitions. If you have three, the arrangement may be something like a single consumer reading two partitions and others own a single partition each. So, the fundamental concept is that the consumers do not share a partition. There is no way we can read the same message more than once.
However, this solution also brings a limitation. The number of partitions on a topic is the upper limit of consumers you can have in a group. So, in our example, if you have five consumers, one of them reads nothing. Kafka won't complain that you have four partitions, but you are starting five consumers. Simply, the fifth consumer will have nothing to read.
So far so good. I have four partitions and four consumer processes. All reading in parallel and no one is reading each other's data. So, no duplicate reads. However, I have another doubt.
How does a consumer enter and exit into a group?
This question is obvious. Isn't it? You started with one Consumer and wanted to scale up, so you added one more. Now you
have two of them. Which partition should this new consumer read? Who should pull some partitions
from the first consumer and assign them to the second consumer? Somebody should be there to manage
This reassignment problem does not end there. Assume you have four consumers, but one crashed, so you are left with three. What should happen to that partition? Who should read it now?
After some time, the collapsed consumer has recovered, so again you have four of them. Now, a reassignment will be required once again.
In a real distributed application, consumers keep joining and exiting. We do not have control over that. My question is, how Kafka handles it? When a consumer joins a group, how is a partition assigned to it? Moreover, what happens to the partition when a consumer leaves the group? Who manages all of this?
Kafka Group Coordinator
The answer is simple. A group coordinator oversees all of this. So, one of the Kafka broker gets elected as a Group Coordinator.
When a consumer wants to join a group, it sends a request to the coordinator. The first consumer
to participate in a group becomes a leader. All other consumers joining later becomes the members
of the group.
So, we have two actors, A coordinator, and a group leader. The coordinator is responsible for managing a list of group members. So, every time a new member joins the group, or an existing member leaves the group, the coordinator modifies the list.
On an event of membership change, the coordinator realizes that it is time to rebalance the partition assignment. Because you may have a new member, and you need to assign it some partitions, or a member left, and you need to reassign those partitions to someone else, So, every time the list is modified, the coordinator initiates a rebalance activity.
Kafka Group Leader
The group leader is responsible for executing rebalance activity. The group leader will take a list of current members, assign partitions to them and send it back to the coordinator. The Coordinator then communicates back to the members about their new partitions. The important thing to note here is, during the rebalance activity, none of the consumers are allowed to read any message.
Let us summarize it quickly.
- Consumer Groups –They are used to read and process data in parallel.
- Partitions are not shared - To protect duplicate reads in a group, Kafka does not allow more than one Consumers to read data from a single partition at the same time.
- A Group Coordinator - A broker is designated as a group coordinator and it maintains a list of active consumers.
- Rebalance - Every time the list of active consumers is modified, the coordinator orders a rebalance activity to the group leader.
- The Group leader - executes a rebalance activity.
Rebalance activity is nothing but assigning partitions to individual consumers.
That’s it for this session. Thank you for visiting learning journal. Keep learning and keep growing.