What is Kafka?
Apache Kafka is a distributed streaming platform that is built on the principals of a messaging system. Apache Kafka implementation started as a messaging system to create a robust data pipeline. However, over the time, Kafka has evolved into a full-fledged streaming platform that offers all the core capabilities to implement stream processing applications over a real-time data pipelines. The latest version of Apache Kafka comes in three flavors.
- Kafka Client APIs
- Kafka Connect
- Kafka Streams
In this article, I will briefly introduce you to all these three flavors of Apache Kafka.
Kafka Client APIs
Like any messaging system, Apache Kafka has three main components.
These three components are at the core of Apache Kafka. Everything else is built over and above that core infrastructure. The three Kafka core components provide the capability of creating a highly scalable messaging infrastructure and real-time streaming data pipelines. Like any other messaging system, Apache Kafka works in asynchronous mode. The following diagram explains the core functionality of Apache Kafka.
Kafka broker is the core infrastructure of Kafka. It is a cluster of computers that are running Kafka broker services. In a typical case, one machine runs one instance of Apache Kafka broker. Kafka broker can store Kafka messages in local disk-based storage. The Kafka broker also comes will the replication capability, and hence a message that is received by one broker service is also copied and replicated to some other brokers. The replication provides a fault tolerance capability to Kafka broker. In the case of a broker service in the cluster going down for some reason, the other brokers can server the message from their copy.
Kafka producers are the applications that send data to Kafka brokers using specific Kafka client APIs. Apache Kafka provides a set of producer APIs that allows applications to send continuous streams of data to the cluster of Kafka Brokers. You can implement multiple instances of the producer applications, and all of them can simultaneously transmit data to the brokers. That notion of numerous Kafka producer applications sending data to Apache Kafka Brokers is the core of Kafka’s scalability on the producer side of the data pipeline. Apache Kafka also provides the notion of topics. Kafka producers always send data to a defined topic, and that allows multiple applications to group their data and keep it separate from other application‘s data into their own topics.
Kafka consumers are the applications that request and consume data from the Kafka brokers using specialized Kafka client APIs. Apache Kafka provides a set of consumer APIs that allows applications to receive continuous streams of data from the cluster of Kafka Brokers. You might implement multiple instances of consumer applications, and all of them can simultaneously read data from Brokers. Kafka consumer APIs also offer the notion of consumer groups. You can group your consumers to share the workload of reading and processing data. Each consumer in the group receives a portion of the data and Kafka broker ensures that the same data record is not sent to two consumers in the given consumer group. The notion of the consumer groups is the core of Kafka’s scalability on the consumer side of the data pipeline.
Kafka connect is built on top of Kafka core components. Kafka connect offers you a reliable and a scalable method to move data between the Kafka broker and the other data sources. Kafka connect offers you two different things to achieve the data movement.
- Off the shelf Kafka connectors
- Kafka connect APIs and a framework
These are ready to use and off the shelf Kafka connectors that you can use to move data between Kafka broker and other applications. For using Kafka connectors, you do not need to write code or make changes to your applications. Kafka connectors are purely based on configurations. You can classify these Kafka connectors in two different groups.
- Source connector
- Sink connectors
Source connectors are built on the foundation of the Kafka producers. You can use a
source connector to pull data from a
source system (for example RDBMS) and send it to Kafka Broker.
Sink connectors are the complementary part of the source connectors, and they are built on the foundation of Kafka Consumers. You can use a sink connector to pull data from the Kafka broker and send it to the target system (For example HDFS).
The Kafka community has developed many off the shelf source and sync connectors for a variety of systems. You can get an extensive list of Kafka connectors at Kafka Connect Hub.
Kafka connect Framework
The second part of the Kafka connect is a robust and easy to use development framework. The Kafka connect framework allows you to develop your own custom Source and Sink connectors quickly. If you do not have a ready to use connector for your system, you can leverage the Kafka connect framework to develop your own connectors. The framework makes it simpler for you to write high quality, reliable, and high-performance custom connector. By using the Kafka connect framework you will be able to scale down the development, testing, and small production deployment lifecycle.
Why do we need Kafka connect?
When creating a data pipeline using Kafka, you have a choice of implementing a
producer/consumer or using a ready use off
the shelf connector. You might also consider an option of developing your custom
using the Kafka connect framework. However, the question is when to use what?
Kafka producers and consumers are embedded in your application. They become an integral part of your application. Your application might be persisting data in a storage system like a database or a log file. However, you also wanted to send data to a Kafka broker for further consumption by other application, and hence you modified your application code and implemented Kafka producer APIs to send data to a Kafka broker. This approach works perfectly well when you have access to the application code, and you can modify the application code.
When you do not have access to the application code or you do not want to embed the
Kafka producer or a consumer in your
application to achieve modularity and simply management, you should prefer to use a
connector. If your application is persisting data in some storage system and you
access to the storage system, you should prefer to use the Kafka connector to build
data pipeline. Kafka connector should be able to cut down your development
and it can be used and managed by non-developers.
If the connector does not exist for your storage system, and you have a choice of embedding your connector in the application or develop a new connector for your application, it is recommended to prefer to create a new connector. Because the framework provides out-of-the-box features like configuration management, offset storage, parallelization, error handling, support for different data types, and standard management REST APIs.
Apache Kafka client APIs, Kafka connect, and the Kafka brokers together provide a reliable and highly scalable backbone infrastructure for delivering data streams among applications. You can use your choice of stream processing system to develop a real-time streaming application. Apache Spark, Apache Storm, Apache Flink are among the most popular stream processing frameworks. However, starting from Kafka 0.10 release, Kafka includes a powerful stream processing library, and that is what we call Kafka Streams. Kafka streams library allows Kafka developers to extend their standard applications with the capability for consuming, processing and producing new data streams. By implementing Kafka Streams for your real-time stream processing requirements compared to other cluster based stream processing systems like Apache spark, you can avoid the cost and overhead of maintaining an additional cluster. Apache Spark might make more sense when you have a distributed machine learning algorithm, and you know that you will need the capabilities of the Spark cluster. However, for simple real-time applications where you do not have enough justifications to build a Spark cluster, Kafka stream is a great alternative.
Kafka streams allow you to process your data in real-time on per record basis. You do not need to group data in small batches and work on the micro batches like other stream processing framework. The ability to work on each individual record as it arrives is critical for the millisecond response time. A typical Kafka stream application would read the data from Kafka topic in real time, perform necessary action on the data and possibly send it back to Kafka broker in a different topic. You can still use Kafka Producers, Kafka Consumers, and Kafka connectors to handle rest of your data integration needs within the same cluster. A typical implementation might use all three flavors of Apache Kafka to solve the bigger problem and create a robust application.