Welcome to Apache Kafka Tutorial at Learning journal. In this session, I will introduce you to Kafka. We will try to understand
Kafka in less than 10 minutes. I am assuming that you have at least heard about Kafka and you already
know that it is an Open Source project. Kafka was initially developed at LinkedIn and later open
sourced in 2011. Since then it has evolved and established itself as a standard tool for building
real-time data pipelines. Now it's securing its share in real-time streaming applications as well.
The Kafka documentation says it is a distributed streaming platform. That's good for definition. But I want to know what it can do for me or what I can do using Kafka.
The official documentation says that Apache Kafka is similar to enterprise messaging system. I guess, you already understand a messaging system. In a typical messaging system, there are three components.
- Producer or Publisher
The producers are the client applications, and they send some messages.
The Brokers receive those messages from publishers and store them.
The consumers read the message records from brokers.
Kafka Use Case
A messaging system looks very simple. Now let us look at the data integration problem in a large organization. I borrowed the below diagram from Jey Creps blog.
The above diagram shows the data integration requirement in a large enterprise.
Does it look like a mess?
There are many source systems and multiple destination systems. And you are given a task to create data pipelines to move data among those systems. For a growing company, the number of source and destination systems keep getting bigger and bigger. Finally, your data pipeline looks like a mess. I am sure that I don't need to explain that you can't manage and maintain that kind of data pipeline. Some part of your pipeline will keep breaking every day.
However, if we can use a messaging system for solving that kind of integration problem, the solution may be neater, and cleaner as shown below.
That's the idea discovered by the team at LinkedIn. Then they started evaluating existing messaging systems, but none of them meet their criteria to support the desired throughput and scale. Finally, they end up creating Kafka.
What is Kafka?
At the core, Kafka is a highly scalable and fault tolerant enterprise messaging system. Take a look at the
Apache Kafka diagram from official
documentation. I hope you understand the producer, consumer and the broker that the figure shows.
At the top of the diagram, the Producer applications are sending messages to Kafka cluster. The Kafka
cluster is nothing but a bunch of brokers running in a group of computers. They take message records
from producers and store it in Kafka message log.
At the bottom of the picture, there are consumer applications. They read messages from Kafka cluster, processes it and do whatever they want to do. They may want to send them to Hadoop, Cassandra, HBase or may be pushing it back again into Kafka for someone else to read these modified or transformed records.
Now let us turn our focus on other two things in this diagram.
Let me ask a question. What is a stream?
Well, I will say
continuous flow of data. or you can define it as
stream of messages.
Kafka, as a messaging system is so powerful regarding throughput and scalability that it allows you to handle a continuous stream of messages. If you can just plug in some stream processing framework to Kafka, it could be your backbone infrastructure to create a real-time stream processing application. And that is what right side of the diagram is trying to explain. Those are some stream processing applications. They read a continuous stream of data from Kafka, process them and then either store them back in Kafka or send them directly to other systems. Kafka provides some stream processing APIs as well. So you can do a lot of things using Kafka stream processing APIs, or you can use other stream processing frameworks like Spark streaming or Storm.
The next thing is Kafka connector. These are the most compelling features. They are ready to use connectors to import data from databases into Kafka or export data from Kafka to databases. These are not just out of the box connectors but also a framework to build specialized connectors for any other application.
Let us summarize all that we learned in this session.
- Kafka is a distributed streaming platform. You can use it as an enterprise messaging system. That doesn't mean just a traditional messaging system. You can use it to simplify complex data pipelines that are made up of a vast number of consumers and producers.
- You can use it as a stream processing platform. There are two parts of stream processing. Stream and a Processing framework. Kafka gives you a stream, and you can plug in a processing framework.
- Kafka also provides connectors to export and import bulk data from databases and other systems.
But implementing these things is not that simple. There is no plug and play component. You need to use APIs and write a bunch
of code. You need to understand some configuration parameters and tune or customize Kafka behavior
according to your requirement and use case.
We will cover all these things in this training. So, keep watching.