Kafka Enterprise Architecture
Apache Kafka is an open source streaming platform. It is designed to provide all
the necessary components of managing data
streams. However, developing enterprise level solutions is requires you to cover a
of aspects. In this article, we will examine some of the key components used for a
enterprise Kafka deployment.
Apache Kafka provides the following core components to deal with the streaming data.
- Kafka Broker
- Kafka Client APIs
- Kafka Connect
- Kafka Streams
If you are inclined towards developing a streaming platform using a Kafka stack, you might need to implement all of these components. Let’s quickly talk about the core purpose of these Kafka components.
Kafka brokers are the primary storage and messaging components of Apache Kafka. The servers that run the Kafka cluster are called brokers. You will usually want to have at least 3 Kafka brokers in a cluster. However, if you are willing to maintain three replicas of your messages, we recommend having at least four brokers. This will give you some protection against the one node failure and enough time to fix the problem. Kafka broker uses the Zookeeper. The Zookeeper is a mandatory component in every Apache Kafka cluster. To provide high availability, you will need at least 3 Zookeeper nodes, and it is recommended to install zookeepers on separate nodes.
Apache Kafka clients are used in the applications that produce and consume messages. Apache Kafka provides a set of producer and consumer APIs that allows applications to send and receive continuous streams of data using the Kafka Brokers. These APIs are available as Java APIs. However, there are other alternatives such as C++, Python, Node.js and Go language. All these other language APIs are based on librdkafka. The librdkafka is a C library implementation of the Apache Kafka protocol, and other languages are supported as a language binding on the librdkafka.
Kafka connect is built on top of Kafka core components. The Kafka connect includes
a bunch of ready to use off the shelf
Kafka connectors that you can use to move data between Kafka broker and other
For using Kafka connectors, you do not need to write code or make changes to your
Kafka connectors are purely based on configurations.
The Kafka connect also offers a framework that allows you to develop your own custom Source and Sink connectors quickly. If you do not have a ready to use connector for your system, you can leverage the Kafka connect framework to develop your own connectors.
Starting from Kafka 0.10 release, Kafka includes a powerful stream processing library, and that is what we call Kafka Streams. Kafka streams library allows Kafka developers to extend their standard applications with the capability for consuming, processing and producing new data streams. Kafka streams library is capable of handling load-balancing and failover automatically. To maintain its application state, Kafka Streams uses embedded Rocks DB database. It is recommended to use persistent SSD disks for the Rocks DB storage.
What more do you need?
The components that we discussed so far and good enough for small to medium-sized systems. However, they are not enough to handle enterprise-grade requirements. An enterprise application may need many other tools and facilities. Some of the examples include the following.
- Schema Management
- REST Interface
- MQTT Interface
- Auto balancing
- Operations and Control
I have tried to cover all the above along with the Kafka core components to show their place in an enterprise level solution. Let’s quickly cover all of the above in reference to the figure shown below.
You might be dealing with some unstructured data. However, in almost every
case, at some point in your data pipeline, you
need to extract some structured information out of your unstructured data and
it further down the line in your data pipeline. Every system, no matter how
it is, will have some structured data in your data pipeline. The structured
will also have a schema. No matter you designed your system for schema-on-read
schema-on-write, you will have to deal with the schema.
You have a data pipeline, and your data flows from one system to another system. How do you propagate the schema over the pipeline? Every system that is part of the data pipeline must know the schema. Without the schema, they won’t be able to make any sense of the raw bytes of the data stream. You have two alternatives to solve this problem.
Include the schema with the data
This alternative is not a good option because including schema with each record will increase the data size significantly. You are not sending a file, and you do not have an opportunity to include a header row that defines the schema for all the other rows in the file. You are sending a stream of data, and you do not have an option to include a header row. You can’t add the schema with each row as it will increase the data size significantly.
Place the schema at a central location
Keeping your schema at a central location is an excellent solution. However, you need to have a schema registry to store and serve your metadata. Your schema registry must be able to support schema evolution and save multiple versions of your evolving schema. The schema registry solution should be tightly integrated with your Kafka application to simplify serialization and deserialization of your messages.
Some of your applications might want to leverage RESTful HTTP protocol for producing and consuming messages to and from your Kafka brokers. It may not be possible to use native Kafka client APIs with all of your applications for a variety of reasons. In such requirements, you may need to implement an HTTP server that provides a RESTful interface to a Kafka cluster. The RESTful server is typically deployed on a separate set of machines. If you have a lot of producer and consumer applications interacting to the REST and you need high availability, it is recommended to deploy multiple REST Proxy servers behind a load balancer. When using the high-level consumer API, it is important that all requests to the same consumer will be directed to the same REST Proxy server, so use of a “sticky” load balancing policy is recommended.
If you have IoT devices that need to send data to your Kafka brokers, you might need to support MQTT protocol for your IoT devices to push data. MQTT is a machine-to-machine (M2M) IoT connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport like Apache Kafka. However, unlike Kafka, MQTT is useful for connections with remote locations where a small code footprint is required, and network bandwidth is at a premium. You have the option to implement an MQTT broker and use Kafka MQTT connector to bring data to Kafka broker. Alternatively, you can implement the Confluent MQTT Proxy to avoid intermediate MQTT brokers.
You might start with a small Kafka cluster and add some new brokers in your cluster at a later stage to scale up your cluster horizontally. In that case, you might need to move some partitions from existing brokers to the newly added brokers to balance the load between all brokers. You need a solution to handle this requirement in an optimal way and help scale Kafka clusters.
Operations and Control
Apache Kafka cluster is the core of your streaming infrastructure. You will need some application for monitoring your cluster, production data pipelines, and streaming applications. It is vital to be able to track your data streams end to end, from producer to consumers. You need the capability to verify that every message sent is received, and to measure system performance end to end. You also need to have information to understand your cluster usage better and identify any problems. You must be able to configure alerts to notify your teams when end-to-end performance does not match SLAs.