Welcome to Apache Kafka tutorial at Learning journal. In this session, we will talk about some basic concepts associated
with Kafka. The objective of this article is to introduce you to the main terminologies and build
a foundation to understand and grasp rest of the training.
In this session, we will cover following things.
- Consumer Groups
We will be using these terms extensively during our discussion of Apache Kafka. It is crucial that we both, myself and you have the same understanding of these concepts. So let me explain my understanding of these terms and some other related concepts associated with these keywords.
The first item is the producer. So, what is a producer?
The producer is an application that sends data. Some people call it data, but we will call it a message or a message record. These messages can be anything ranging from a simple string to a complex object. Ultimately it is small to the medium-size piece of data. The message may have different meaning or schema for us. But for Kafka, it is a simple array of bytes.
For example, if I want to send a file to Kafka, I will create a producer application and push each line of the file as a message. In this case, a message is one line of text. But for Kafka, it is just an array of bytes. Similarly, If I want to send all the records from a table, I will submit each row as a message, or if I want to send the result of a query. I will create a producer application, fire a query against my database, collect the result and start throwing each row as a message. So, while working with Kafka, if you want to send some data, you have to create a producer application. It is unlikely that you get a readymade producer that fits your purpose.
The next thing is the consumer. The consumer is again an application that receives data. If producers are sending data, they
must be sending it to someone. Right? The consumers are the recipients. But remember that the producers
don't send data to a recipient address. They just send it to Kafka server. And anyone who is interested
in that data can come forward and take it from Kafka server. So, any application that requests data
from a Kafka server is a consumer, and they can ask for data send by any producer provided they have
permissions to read it.
So just continuing the file example, If I want to read the file sent by a producer, I will create a consumer application, then I will request Kafka for the data. The Kafka server will send me some messages. I think you remember that each message is a line of text in this example.
So, the client application will receive some lines from Kafka server, it will process them and again request for some more messages. The client keeps demanding data, and Kafka server will keep giving message records as long as new messages are coming from the producer.
Now, let’s move on and try to understand a Broker. The broker is Kafka server. It is just a meaningful name given to Kafka server. And this title makes sense as well because all that Kafka does is act as a message broker between producer and consumer. The producer and consumer don not interact directly. They use Kafka server as an agent or a broker to exchange messages.
Let's come to the next term. The cluster. This one is simple. If you have any background in distributed systems, you already know that a Cluster is a group of computers acting together for a common purpose. Since Kafka is a distributed system, so the cluster has the same meaning for Kafka. It is merely a group of computers, each executing one instance of Kafka broker.
Next item is the topic. We learned that producer sends data to Kafka broker. Then a consumer can ask for data from the Kafka
broker. But the question is, Which data?
Let's try to understand this by a simple conversation between Broker and the consumer.
Broker - I am collecting data from multiple producers, which one do you want?
Consumer - Give the data sent by producer ABC.
Broker - Oh Man, producer ABC is pushing three different types of records. Which one do you want?
Consumer - Well, send me the sales data.
Broker - Ok, so you are looking for sales data. Two more producers are sending sales data.
Consumer - Gosh, we need to have some identification mechanism.
There comes the notion of the Kafka Topic. So, the topic is an arbitrary name given to a data set. We better say that it is a unique name for a data stream.
For example, we create a topic called Global Orders, and every point of sales may have a producer. They send their order details as a message to the single Topic named Global Orders. And a subscriber interested in Orders can subscribe to the same Topic.
By now, you learned that the broker would store data for a topic. This data can be enormous. It may be larger than the storage
capacity of a single computer. In that case, the broker may have a challenge in storing that data.
One of the obvious solutions is to break it into two or more parts and distribute it to multiple
computers. Kafka is a distributed system that runs on a cluster of machines. So, it is self-evident
that Kafka can break a topic into partitions and store one partition on one computer. And that's
what the Partition means.
You may be wondering that how Kafka will decide on the number of partitions. I mean, some topics may be large, but others may be relatively small. So how Kafka knows that it should create 100 partitions or just ten partitions could be enough?
The answer is simple. Kafka doesn't take that decision. We, as a developer make that decision. When we create a topic, we make that decision, and Kafka broker will create that many partitions for our Topic. But remember that every Partition sits on a single machine. You can't break it again. So, do some estimation and simple math to calculate the number of partitions.
Let's talk about offset. The offset is simple. It is a sequence number of a message in a partition. This number is assigned
as the messages arrive in a partition. And these numbers, once assigned, they never change. They
are immutable. This sequencing means that Kafka stores messages in the order of arrival within a
partition. The first message gets an offset zero. The next message receives an offset one and so
on. But remember that there is no global offset across partitions. Offsets are local to the partition.
So, if you want to locate a message, you should know three things.
Topic name, Partition number, and an offset number. If you have these three things, you can directly locate a message.
Kafka Consumer Groups
Now we are left with the last thing. The consumer groups. We already understand the Consumer. What is Consumer Group?
It is a group of consumers. Several Consumers form a group to share the work. You can think of it like there is one large task and you want to divide it among multiple people, so you create a group, and members of the same group share the work. Let me give you an example.
A Kafka Example
Let's assume that we have a retail chain. In every store, there are few billing counters. You want to bring all the invoices from every billing counter to your data centre. Since you learned Kafka and you find Kafka as an excellent solution to transport data from billing locations to the data centre. You decided to implement it. The first thing you might want to do is to create a producer at every billing site. These Producers will send bills as a message to a Kafka Topic. The next thing you might want to do is to create a consumer. The Consumer will read data from Kafka Topic and write them into your data centre. It sounds like a perfect solution. Right? But there is a small problem. Think of the scale. You have hundreds of producers pushing data into a single topic. How will you handle that volume and velocity?
You learned Kafka exceptionally well. So, you decided to create large Kafka cluster and partition your Topic. Correct? So,
your Topic is partitioned and distributed across the Cluster. Now several brokers are sharing the
workload to receive and store data. From the source side, you have many producer and several Brokers
to share the workload. What about the destination side? You have a single unfortunate consumer.
There comes the Consumer group. You create a Consumer group and start executing many Consumers and tell them to divide the work.
So far so good. But how do we split the work? That's not a difficult question. I have 600 partitions. And I am starting 100 consumers. So why don't each of the consumer take six partitions? We will see, if they can't handle six partitions, we will start some more Consumers in the same group. We can go up to 600 Consumers, so each consumer will have just one partition to read.
If you followed this example correctly, you understand that partitioning and consumer group is a tool for scalability. And notice that the maximum number of Consumers in a group is the total number of partitions you have on a topic. Kafka doesn't allow more than one Consumer to read from the same partition simultaneously. This restriction is necessary to avoid double reading of records.
Great. I hope you learned core concepts of Kafka. Now you are familiar with the essential terminology that we will be using throughout the Kafka tutorials.