Chapter 5 - Why Messaging System
In the earlier chapter, we evaluated database tables to provide the core infrastructure for real-time data integration among the applications. In later part of the lesson, we discovered that the log files might be a better representation of streams rather than the tables, and hence we resolved to Apache Kafka. Apache Kafka claims a central position in the event streaming architecture, and we also learned that Apache Kafka implements a messaging system or the pub/sub semantics. However, we do not know why? Before we start learning Kafka architecture, it makes sense to quickly refresh the main idea of a messaging system and highlight why and how Kafka implements the same notion.
Data Integration Problem
We cannot develop a single standalone application that does everything in an enterprise. Applications can be custom designed, or they can be purchased from the third-party vendors. Some of the applications may be running outside of the organization and maintained by the partners or the service providers. These applications may be running on multiple computers, on different platforms, using different technologies and could be geographically dispersed. We need to integrate these applications together to achieve a unified set of functionalities. The point is straightforward. Data integration among the systems is not a new requirement. However, when you are transforming your organization to respond to event streams, the data integration becomes more challenging.
Decision Criteria
Like any technological decision, real-time data integration requires a range of considerations and consequences. Let’s talk about some critical decision criteria.
- Time sensitivity of events
- Decoupling of applications
- Data format evolution and extensibility
- Reliability
- Scalability
One of the critical requirements of real-time data integration is to make data
available to other applications in a time-critical manner. The simplest method to
accomplish real-time integration is by exchanging data as frequently as possible
and do it in small chunks.
The next requirement is to achieve decoupling of applications. Strong coupling
of applications is based on some assumptions about how the other application works.
These assumptions create dependencies and hurdles in evolving the apps over time.
Therefore, the integration should be general enough to allow both the applications
to change as needed. The integration goal should be to minimize the dependency
among data producers and data consumers and yet achieve the integration needs.
The data format is another critical problem. We can’t change existing
applications to use a unified data format. We may need the flexibility to add an
intermediate translator to unify the applications that insist on different data
formats. The data producers may also evolve over time, and we need the flexibility
to support the evolution and extensibility of the data formats as the applications,
and the business need evolves.
Different applications may be scattered on discrete computers across
geographies. Establishing a synchronous connection or making a remote procedure
call may not be a reliable option as the systems may not be available all the time
or the network may be temporarily unavailable. These problems can be simplified by
using asynchronous communication and placing a reliable middleman to facilitate
communication.
Scalability is obvious. We are living in the big data era, and we want our
systems to be horizontally scalable. You may start with the couple hundred events a
day, but as the business grows, the solution should be scalable to supports
millions and trillions of events.
A good data integration solution should at least meet these five decision
criteria. Let’s evaluate some of the available options.
Integration Alternatives
There are many ways of integrating applications for the data exchange. However, all those approaches can be broadly grouped into four main patterns.
- Remote procedure invocation
- Shared database
- File transfer
- Messaging
RMI or RPC is mainly used to trigger some action using APIs at the other
application and provide the necessary data for the operation. The API is typically
offered by the receiver application, and it needs to be implemented by the sender
application. This approach creates a tight coupling among the applications. We
already evaluated this option in the earlier chapter and realized that it fails on
the decoupling and scalability parameters.
The next pattern is a shared database. We have partially evaluated this option
in the earlier chapter. In this approach, all the data producers write to a shared
database, and all the data consumers read from the same database. This option fails
on many parameters such as data format evolution and extensibility, scalability,
and reliably supporting applications dispersed across the locations.
Designing a unified schema that meets the need for all the participating
applications is a practically impossible task. This problem becomes more prominent
if you need to integrate with packaged applications or the external applications
that are not flexible enough to adapt to the shared schema design.
Multiple applications simultaneously reading and modifying the same data will
bottleneck the database performance. When systems are distributed across various
locations, accessing a single, shared database is not practical.
File transfer is the most straightforward pattern for implementing data
integration. The data producer application produces a file, and all the data
consumer applications are supposed to consume the data files. The most obvious
example is the export-import operations. These files could be XML, JSON or maybe a
simple delimited file formats such as CSV. However, this approach is obviously not
going to work for real-time requirements, and they fail on the time-sensitivity
parameter. Generating files and handling them on-time may get you into many other
trade-offs.
The final data integration pattern is the messaging system. Apache Kafka
adopted a messaging system pattern to solve the real-time data integration problem.
In the next section, we will try to understand the messaging system in general.
Rest of the chapter will talk about the Kafka architecture and how Kafka meets
those five decision criteria that we discussed in this section.
Enter the Messaging System
As the name applies, a messaging system is all about sending small chunks of data from one system to other systems. For our retail example, a single invoice could be considered as a message. When we talk about messages in this context, we mean those events that we discussed in the earlier chapter. It could be a JSON record or something similar. There are two approaches to messaging.
- Point to Point
- Publish/Subscribe or pub/sub
A point-to-point (PTP) messaging system is built on the concept of message queues, senders, and receivers. Each message is addressed to a specific queue, and receiving clients extract messages from their designated queues. These queues retain all messages sent to them until the messages are consumed or expire. PTP messaging has the following characteristics.
- Each message has only one consumer.
- Sender and receiver are not dependent on each other.
- The receiver acknowledges the successful receipt of a message.
As the name suggests, PTP is more suitable for the one-to-one scenario. Apache
Kafka rejected PTP approach because they wanted to address many-to-many
requirement.
A publish/subscribe (pub/sub) messaging system is built on the concept of
publishers, broker, topic, and subscribers. Let’s try to understand the components
of a pub/sub messaging system.
Publisher
Any application that sends data should act as a publisher. For our retail example, the POS application works as a publisher, and it sends the invoices for other apps. People use different names such as data publisher, data producer, and sender, however, it means the same thing – a publisher.
Subscriber
An application that reads data sent by the publisher is a subscriber. For our retail example, the services such as shipment and loyalty are subscribers. They read invoices that are posted by the POS systems. You might notice people referring to them as data consumers or receivers; however, all that means the same thing – a subscriber.
Broker
The broker is at the heart of the pub/sub messaging system. The broker is
responsible for receiving messages from the publishers, storing them into a log
file, and sending the messages to the subscribers. The broker acts as a middleman
between publishers and the subscribers.
Apache Kafka is a message broker, and it offers two sets of APIs.
- Producer API
- Consumer API
Any application that wants to send a message should use producer API to send the data to Apache Kafka. Kafka receives the message, sends acknowledgment and persists the data into a log file. When a consumer application wants to read the message, they use consumer API to read the messages from the broker.
Topic
The topic is the mechanism to categorize the messages. If you are coming from the database world, you can think of the topic as a table name. The producer always writes the message to a topic, and the consumer reads it from the topic. The broker creates a log file for each topic. When a producer sends a message, it specifies the topic name for the message and accordingly the broker persists the message in the corresponding log file. When a consumer wants to read the message, they consume messages from the specified topic.
Kafka in the data ecosystem
Apache Kafka can become a circulatory system of your data ecosystem that carries messages to various members of the infrastructure. It occupies a central place in your real-time data integration infrastructure.
The data producers can send data as messages, and they can send it to Kafka broker
as quickly as the business event occurs. Data consumers can consume the messages
from the broker as soon as they arrive at the broker. With careful design, the
messages can reach from producers to consumers in milliseconds.
The producers and consumers are completely decoupled, and they do not need a
tight coupling or direct connections. They always interact with the Kafka broker
using a consistent interface. Producers do not need to be concerned about who is
using the data, and they just send the data once without caring about how many
consumers would be reading the data. Producers and consumers can be added, removed,
and modified as the business case evolves.
When coupled with a system to provide message schema service and a connector
service, the producer and consumers do not need to agree on a unified schema. The
producer application can expose the data in the format they desire. You can always
place a connector that handles the translation process for the consumer. With the
flexibility to put the translator in between, you do not need to ensure that the
consumer understands the producer message format. Message schema service and the
connector framework are add-on features of Apache Kafka infrastructure. We will
cover them in a separate chapter.
The above discussion qualifies pub/sub messaging systems for the first three
decision criteria that we listed in the earlier section. However, we still need to
ensure reliability and scalability. These two features are offered by Kafka
architecture and design. Apache Kafka is not a traditional pub/sub messaging
system. It is a distributed, fault tolerant and highly scalable system that is
designed to provide the backbone infrastructure for real-time streaming
requirements. We will cover the reliability and scalability in the next chapter.
Summary
In this lesson, we listed out some critical requirements for an acceptable data integration solution. Then I talked about four high-level alternatives and evaluated all of them against the decision criteria. Finally, I introduced you to the messaging systems and explained the point-to-point as well as pub/sub semantics. Then we went ahead and evaluated the pub/sub messaging systems against our requirement for the real-time data integration solution. We concluded this chapter by realizing that the pub/sub could be an excellent alternative for real-time data integration. In the next chapter, we will talk about Kafka architecture and design to understand how Kafka provides reliability and scalability to the real-time streaming platform.
You will also like:
What is Programming and Programming Language
"PROGRAMMING" is the word which we all come across quite often. Let us see what that means.
Anonymous Functions
Learn Scala Anonymous Functions with suitable examples.
Statements and Expressions
Statements and Expressions in Scala. How are they different?
Artificial Intelligence and Machine Learning
With AI and Machine Learning evolving day by day, let us stay updated with the technology.
Pattern Matching
Scala takes the credit to bring pattern matching to the center.