Apache Kafka Core Concepts
In this article, I will introduce you to Apache Kafka. We will try to understand
Kafka and some core concepts associated
with it. I am assuming that you have at least heard about Apache Kafka and you
know that it is an Open Source project. Kafka was initially developed at LinkedIn
later open sourced in 2011. Since then it has evolved and established itself as a
tool for building real-time data pipelines. Now it is securing its share in
streaming applications as well.
The Kafka documentation says it is a distributed streaming platform. That is good for definition. But I want to know what it can do for me or what I can do using Apache Kafka.
A messaging system
The official documentation says it is similar to enterprise messaging system. I guess, you already understand a messaging system. In a typical messaging system, there are three components.
The producers are the client applications, and they send some messages. The Brokers receive messages from publishers and store these messages. The consumers read the message records from brokers.
Apache Kafka Use Case
A typical messaging system appears very simple. Now let us look at the data integration problem in a large organization. I borrowed the below figure from Jey Creps blog.
The first part of the diagram shows the data integration requirement in a large
enterprise. Does it look like a mess? There
are many source systems and multiple destination systems. And you are given a task
create data pipelines to move data among these systems. For a growing company, the
of source and destination systems keep getting bigger and bigger. Finally, your
pipeline becomes a mess. Some part of your pipeline will keep breaking every day.
However, if we can use a messaging system for solving this kind of integration problem, the solution may be neater, and cleaner as shown in the second part of the above figure. That's the idea discovered by the team at LinkedIn. Then they started evaluating existing messaging systems, but none of them meet their criteria to support the desired throughput and scale. Finally, they end up creating Kafka.
What is Apache Kafka?
So at the core, Kafka is a highly scalable and fault tolerant enterprise messaging system. Let us look at the Kafka diagram from official documentation.
The first thing is a producer. I hope you already understand producers. They send
messages to Kafka cluster.
The second thing is the Cluster. A Kafka cluster is nothing but a bunch of brokers running in a group of computers. They take message records from producers and store it in Kafka message log.
At the bottom, we have consumer applications. The consumers read messages from Kafka cluster, processes it and do whatever they want to do. They might send the messages to Hadoop, Cassandra, HBase or may wish to push it back again into Kafka for someone else to read those modified or transformed records.
Now let us turn our focus on other two things in the above diagram.
Kafka offers a fantastic throughput and scalability that you can easily handle a continuous stream of messages. So, if you can just plug in some stream processing framework to Kafka, it could be your backbone infrastructure to create a real-time stream processing application. Hence diagram shows some stream processing applications. They read a continuous stream of data from Kafka, process them and then either store them back to Kafka or send them directly to other systems. Kafka provides some stream processing APIs as well. So you can do a lot of things using Kafka stream processing APIs, or you can use other stream processing frameworks like Spark streaming or storm.
The next thing is Kafka connector. These are another compelling features. They are ready to use connectors to import data from databases into Kafka or export data from Kafka to databases. These are not just out of the box connectors but also a framework to build specialized connectors for any other application.
Kafka is a distributed streaming platform. You can use it as an enterprise messaging system. That doesn't mean just a traditional messaging system. You can use it to simplify complex data pipelines that are made up of a vast number of consumers and producers. You can use it as a stream processing platform. It also provides connectors to export and import bulk data from databases and other systems.
However, implementing Apache Kafka is not that simple. There is no plug and play component. You need to use APIs and write a bunch of code. You need to understand some configuration parameters and tune or customize Kafka behavior according to your requirement and use case.
Let's talk about some basic concepts associated with Kafka. I will try to introduce you to the following terminologies.
- Consumer groups
What is a Kafka producer?
The producer is an application that sends data. Some people call it data, but we will call it a message or a message record. These messages can be anything ranging from a simple string to a complex object. Ultimately it is a small or a medium-size piece of data. The message may have different meaning or schema for us. But for Kafka, it is a simple array of bytes. For example, if I want to push a file to Kafka, I will create a producer application and send each line of the file as a message. In that case, a message would be a line of text. But for Kafka, it is just an array of bytes. Similarly, If I want to send all the records from a table, You would send each row as a message, or if I want to send the result of a query. I will create a producer application, fire a query against my database, collect the result and start sending each row as a message. So, while working with Kafka, if you want to send some data, you have to create a producer application.
What is a Kafka consumer?
So the consumer is an application that receives data. If producers are sending
data, they must be sending it to someone,
right? The consumers are the recipients. But remember that the producers don't send
to a recipient address. They just send it to Kafka server. And anyone who is
in that data can come forward and take it from Kafka server. So, any application
requests data from a Kafka server is a consumer, and they can ask for data send by
producer provided they have permissions to read it.
So just continuing on the file example, If I want to read the file sent by a producer, I will create a consumer application, then I will request Kafka for the data. The Kafka server will send me some messages. So the client application will receive some lines from Kafka server, it will process them and again request for some more messages. The client keeps asking data and Kafka will keep giving message records as long as new messages are coming from the producer.
What is a Kafka broker?
The broker is the Kafka server. It's just a meaningful name given to the Kafka server. And this name makes sense as well because all that Kafka does is act as a message broker between producer and consumer. The producer and consumer don't interact directly. They use Kafka server as an agent or a broker to exchange messages.
What is a Kafka cluster?
If you have any background in distributed systems, you already know that a cluster is a group of computers acting together for a common purpose. Since Kafka is a distributed system, so the cluster has the same meaning for Apache Kafka. It is merely a group of computers, each executing one instance of Kafka broker.
What is a Kafka topic?
We learned that producer sends data to the Kafka broker. Then a consumer can ask for data from the Kafka broker. But the question is, Which data? We need to have some identification mechanism to request data from a broker. There comes the notion of the topic. So the topic is an arbitrary name given to a data set. We better say that it is a unique name for a data stream. For example, We create a topic called Global Orders, and every point of sales may have a producer. They send their order details as a message to the single topic named as Global Orders. And a subscriber interested in Orders can subscribe to the same topic.
What is a Kafka partition?
By now, you learned that the broker would store data for a topic. This data can be
enormous. It may be larger than the storage
capacity of a single computer. In that case, the broker may have a challenge in
that data. One of the prominent solutions is to break it into two or more parts and
it to multiple computers.
Kafka is a distributed system that runs on a cluster of machines. So it is evident that Kafka can break a topic into partitions and store one partition on one computer. And that's what the partition means.
You might be wondering that how Kafka will decide on the number of partitions. I mean, some topics may be large, but others may be relatively small. So how Kafka knows that it should create 100 partitions or just ten partitions should be enough.
The answer is simple. Kafka doesn't take that decision. We have to make that decision. When we create a topic, we make that decision, and Kafka broker will create that many partitions for your topic. But remember that every partition sits on a single machine. You can't break it again. So do some estimation and simple math to calculate the number of partitions.
What is a Kafka offset?
The offset is simple. It is a sequence number of a message in a partition. This
number is assigned as the messages arrive
at the partition. And these numbers once assigned, they never change. They are
This sequencing means that Kafka stores messages in the order of arrival within a partition. The first message gets an offset zero. The next message receives an offset one and increments on to the following message. But remember that there is no global offset across partitions. Offset are local to the partition. So if you want to locate a message, you should know three things. Topic name, Partition number, an offset number. If you have these three things, you can directly locate a message.
What is a consumer group?
It is a group of consumers. So many consumers form a group to share the work. You
can think of it that there is one large
task and you want to divide it among multiple people. So, you create a group, and
of the same group share the work. Let me give you an example.
Let's assume that we have a retail chain. In every store, there are few billing counters. You want to bring all of the invoices from every billing counter to your data center. Since you learned Kafka and you find Kafka as an excellent solution to transport data from billing locations to the data center. You decided to implement it.
The first thing you might want to do is to create a producer at every billing site. These Producers will send bills as a message to a Kafka topic. The next thing you might want to do is to create a consumer. The consumer will read data from the Kafka Topic and write them into your data center. It sounds like a perfect solution. Right?
But there is a small problem. Think of the scale. You have hundreds of producers pushing data into a single topic. How will you handle that volume and velocity? You learned Kafka exceptionally well. So you decided to create large Kafka cluster and partition your topic. So your topic is partitioned and distributed across the cluster. Now several brokers are sharing the workload to receive and store data. From the source side, you have many producer and several brokers to share the workload. What about the destination side?
You have a single unfortunate consumer. There comes the consumer group. You create a consumer group and start executing many consumers in the same group, and tell them to share the workload. So far so good. But how do we split the work?
That's not a difficult question. I have 600 partitions. And I am starting 100 consumers. So each of the consumers takes six partitions. We will see, If they can't handle six partitions, we will start some more consumers in the same group. We can go up to 600 consumers, so each of them will have just one partition to read.
If you followed this example correctly, You understand that partitioning and consumer group is a tool for scalability. And also realize that the maximum number of consumers in a group is equal to the total number of partitions you have on a topic. Kafka doesn't allow more than one consumers to read from the same partition simultaneously. This restriction is necessary to avoid double reading of records.