Installing Multi-node Kafka Cluster

Transcript and Commands

Hello and welcome to Kafka Streams – Real-time stream processing! at Learning Journal. This video complements my book on Kafka Streams. You can get more details about the Kafka Streams Book here.

The book includes several code examples. If you are willing to follow along and try those examples yourself, you will need to set up a small Kafka cluster and some other tools such as development IDE and a build tool. This video is based on the book’s Appendix A – Installing Kafka Cluster. This video will provide detailed instructions to set up the exact environment that is used to create and test the examples in the book. In this video, we will create a three-node Kafka cluster in the Cloud Environment. I will be using Google Cloud Platform to create three Kafka nodes and one Zookeeper server. So, you will need four Linux VMs to follow along. We will be using CentOS 7 operating system on all the four VMs.
Great! Let’s start. We will follow a four-step process.


  1. Preparing the VMs for Kafka
  2. Configuring the Zookeeper
  3. Configuring Kafka Brokers
  4. Testing Kafka Installation

In the first step, we will create VMs in the Google Cloud platform and prepare them to run Kafka processes. I will be using the Google Cloud platform. However, the overall process remains the same on physical machines as well as on other Cloud platforms. All you need is the four CentOS 7 machines with sudo privileges.
In the second step, we will configure and start the Zookeeper server on one machine. We will also test the Zookeeper server process.
The third step is to configure and start Kafka Brokers on three different machines. We will also configure the VM to make sure the brokers auto start over the reboot.
The final and fourth step is to restart all the machines and perform a sanity check for all the services.
That’s all. Once this is done, you will have the exact cluster environment that I will be using throughout the book to execute and test my examples.

Preparing the VMs for Kafka

Let’s start with the first step. Create four VMs in GCP. You can choose whatever name you want, I am naming them as Kafka-0, Kafka-1, Kafka-2, and zookeeper. You can select the nearest zone location. I create them in the Mumbai data centre. Select your CPU and memory configurations. I think a single CPU core with 1.7 GB of RAM is good enough to start with. You can increase these resources later without reconfiguring your cluster. We want to use CentOS 7 as our base operating system. Let’s take 10 GB disk on each machine.
Repeat the same process and create four VMs.
The first thing that we need on these four VMs is the JDK 1.8. I will install OpenJDK for the sake of simplicity.
Let’s do that on all four machines. SSH to your VM and execute the yum command.

                                        
    sudo yum -y install java-1.8.0-openjdk
                    

Repeat the same on all other computers.
Great! Now we want to download Apache Kafka binaries. I will need the wget tool to download anything on the VMs. So, let’s install wget. Execute the yum command. Simple, isn’t it?

                                        
    sudo yum -y install wget
                    

Repeat the same on all the four VMs.
Now you are ready to download Apache Kafka binaries. You can get the download link from the Apache Kafka mirrors website.

Copy one of the mirror URL and download Kafka binaries using wget command.

                                        
    wget http://redrockdigimark.com/apachemirror/kafka/2.0.0/kafka_2.12-2.0.0.tgz 
                    

Done. Let’s uncompress the binaries. You can use the tar command.

                                        
    tar -xzf kafka_2.12-2.0.0.tgz 
                    

Repeat the same on all four VMs
Let’s take a quick look at the uncompressed folder. There are two main directories that we will be referring throughout this video.
The bin directory and the config directory.
The bin folder holds all executables such as various Kafka and Zookeeper tools.
The config directory holds two main configuration files.


  1. zookeeper.properties
  2. server.properties

We define all Zookeeper configurations in the zookeeper.properties file.
And all Kafka broker configurations are defined in the server.properties file.
Great! We will be executing many commands that reside in the bin directory. I don’t want to include directory names all the time when I am running a Kafka command or a zookeeper command. So, let’s add the bin directory in our PATH environment variable.
Open the .bash_profile and add the Kafka bin directory in your path.

                                        
    PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/kafka_2.12-2.0.0/bin
                                    

Repeat the same on all four VMs. Great! Your VMs are ready to start the actual configuration.
Let’s move on to the next step. Configure and start the Zookeeper server.

Configuring the Zookeeper

Apache Kafka needs zookeeper. In a production environment, you would want to configure a zookeeper cluster known as Zookeeper Ensemble. However, for the development activities, you can set up a single Zookeeper instance. I planned to keep Zookeeper on a separate node as all my VMs are tiny machines with single CPU core and less than 2 Gigs of RAM.
So, SSH to your zookeeper machine. We do not need to download Zookeeper separately. The Kafka download also includes a copy of Zookeeper.
The first thing is to check out the Zookeeper configuration file. Let’s open the zookeeper.properties file.

                                        
    vi $HOME/kafka_2.12-2.0.0/config/zookeeper.properties. 
                                    

The only configuration that I want to change is the data directory. The default value is specified as a key-value pair. If you want, you can use the default location. However, I am going to change it to some other appropriate location.

                                        
    dataDir=/home/prashant/zookeeper_data 
                                    

That’s all. We don’t want to change or add any other configuration.
Save the file.
Let me create the Zookeeper data directory.

                                        
    mkdir $HOME/zookeeper_data 
                                    

Great! We are ready to start Zookeeper server. Starting Zookeeper is straightforward. All you need to do is to execute the zookeeper-server-start.sh and provide the zookeeper.properties as an argument.

                                        
    $HOME/kafka_2.12-2.0.0/bin/zookeeper-server-start.sh /home/prashant/kafka_2.12-2.0.0/config/zookeeper.properties 
                                    

Great! My zookeeper server is running. Press CTRL+C to terminate the process. Now I am confident that the configurations are good, and the server starts with no issues. For my day to day comfort, I would want to place zookeeper start command in the rc.local file and enable systemctl to ensure that the zookeeper automatically starts whenever I start the VM. Let’s do that.
Open you /etc/rc.d/rc.local file and place the start command at the bottom of the file. Make sure to specify the full path.

                                        
    /home/prashant/kafka_2.12-2.0.0/bin/zookeeper-server-start.sh /home/prashant/kafka_2.12-2.0.0/config/zookeeper.properties> /dev/null 2>&1 &
                                    

We also want to redirect the standard output, and standard errors to /dev/null and execute the zookeeper in the background.
Great! Save the file and give the execute permission to your rc.local.

                                        
    sudo chmod +x /etc/rc.d/rc.local
                                    

You also need to add the rc-local service to systemctl.

                                        
    sudo systemctl enable rc-local
                                    

Finally, start your rc-local service.

                                        
    sudo systemctl start rc-local
                                    

Great! We are done. Do you want to test your Zookeeper server?
Let’s executing a Zookeeper shell command.

                                        
    kafka_2.12-2.0.0/bin/zookeeper-shell.sh 10.160.0.5:2181 ls /brokers/ids
                                    

This command should report back as “node does not exist.” However, we know that the Zookeeper server is responding. Once you start your Kafka brokers, the same command would give you a list of active Kafka brokers.
Great! We are done with step two. Do not perform the zookeeper configuration on any other node. We need the Zookeeper on a single machine. Right?
Great! The next step is to configure the Kafka broker on remaining three nodes.


Configuring Kafka Brokers

Unlike Zookeeper, we will be changing and adding quite a few configuration properties for the Kafka brokers.
Let’s take a quick look at the main configuration properties that we want to change in the server.properties file on each Kafka node.
I have prepared a table with the details. Let me quickly walk you through the configurations.

Configuring Kafka Brokers
Fig.1 - Configuring Kafka Brokers

The first property is the broker ID. Every Kafka broker needs a unique ID. We will set this value to zero, one and two for the three brokers.
The next item is the broker rack name. This property specifies the rack name of the broker, and it is used in rack aware replication assignment for the fault tolerance. We want the first two brokers to be part of the RACK1 and the third one to be part of the RACK2.
The next item is the log file directory location. This location is the base directory where Kafka broker would store the partition replica. You can keep the default value or change it to some other appropriate directory location. We want to change it to a different directory location. We also need to make sure that the directory already exists. So, I will create this directory on all the broker machines.
The next one is the number of partitions for the offset topic. Kafka internally creates a topic to store offsets. This configuration controls the number of partitions for the offset topic. The default value is quite high, I think 50, that doesn’t make sense for a dev environment. We want to change it to a lower value.
The next property is the replication factor for the offset topic. The default value is three, and we want to bring it down to two.
The next one is the minimum number of replicas in the ISR list. I have talked about all these configurations in my book. The default value is one, and we want to change it to two.
The next one is the default replication factor for automatically created topics. We want to set this value to two.
Finally, the most essential configuration. The Zookeeper connection details. This property specifies the Zookeeper hostname or IP and the port number. We have started the Zookeeper on one of the VMs, and hence the value for this property should represent the host_ip:port of the same machine.
Good. Let’s create the data directory.

                                        
    mkdir /home/prashant/kafka_data
                                    

Now I can go ahead and modify the server.properties file for the first broker.
The broker ID is already zero. Let me add the broker rack here.
The next item is the log directory. Let me change it to the directory that I just created.
Let me change and add all the topic defaults at this place.
Good. The last one is the Zookeeper connection details.
That’s all. We are done with the configurations.
I am ready to start the broker. Starting the broker is as simple as executing the kafka-server-start.sh, and giving the server.properties as an argument.
I don’t see any error messages, and my broker is running. Let me shut it down. Press CTRL+C.
Now I want to place the broker start command in the rc.local file and configure it to autostart as we did for the Zookeeper. Right?
Let’s do that. Open your rc.local file. Add the Kafka server start command at the bottom of the file. Once again, make sure to specify the full path. Let me redirect the standard output, and standard error to /dev/null and change it to the background process.

                                        
    /home/prashant/kafka_2.12-2.0.0/bin/kafka-server-start.sh /home/prashant/kafka_2.12-2.0.0/config/server.properties> /dev/null 2>&1 &
                                    

What’s next? You already know that, right? We did it for the Zookeeper.
Give an execute permission to your rc.local file.

                                        
    sudo chmod +x /etc/rc.d/rc.local
                                    

Add your rc-local service to systemctl.

                                        
    sudo systemctl enable rc-local
                                    

Start your rc-local service.

                                        
    sudo systemctl start rc-local
                                    

Done. Repeat the same steps for all other brokers. I am doing it for three brokers. However, if you want, you can set up five or 10 brokers by simply following the same steps.
Once you finish configuring all the brokers, stop all the VMs.
Now, I start the Zookeeper VM first. Once that is up and running, I start all other broker VMs. This stop and restart will help me to test that all the server processes are automatically starting.
Great! The final step. Test your cluster.

Testing Kafka Installation

SSH to one of the machines.
Execute Zookeeper shell and check the list of active broker IDs.

                                        
    kafka_2.12-2.0.0/bin/zookeeper-shell.sh 10.160.0.5:2181 ls /brokers/ids
                                    

Easy. Isn’t it? We have three broker IDs. Zero, One and two. All three are active.
Let’s create a new topic.
You can use kafka-topics.sh. The first option is the create option, then the zookeeper details, replication factor, number of partitions, and finally the topic name.

                                        
    kafka-topics.sh --create --zookeeper 10.160.0.5:2181 --replication-factor 3 --partitions 3 --topic test
                                    

Now you can list the topics. Again kafka-topics.sh. The first option is the list option, then the zookeeper coordinates.

                                        
    kafka-topics.sh --list --zookeeper 10.160.0.5:2181
                                    

Now, since we have a topic, let’s start a console producer and send some messages. We will use kafka-console-producer, give at least one broker IP and port, then the topic name to which we want to send the messages.

                                        
    kafka-console-producer.sh --broker-list 10.160.0.2:9092 --topic test
                                    

Start typing some messages. Press CTRL+C to exit.
Now the last one. Start a console consumer and check the messages that we sent from the producer. We will use kafka-console-consumer.sh, at least one broker coordinate to bootstrap the consumer, the topic name and the offset from where we want to start reading the messages. This is the first time I am reading it, so let’s start from the beginning.

                                        
    kafka-console-consumer.sh --bootstrap-server 10.160.0.2:9092 --topic test --from-beginning
                                    

I can see all the messages. That all.
In this video, we created one Zookeeper server, three Kafka brokers, we configured all of them to autostart, and we also tested all our services.
Thank you very much. Please visit www.learningjournal.guru for latest technology books and self-paced video training.
Keep learning and Keep growing.

Author : Prashant Pandey -


You will also like:


Streaming Concepts

Let us look at the area concerning the collection and processing of big data streams.

Learning Journal

Why Messaging System

The main idea of a messaging system, and why and how Kafka implements the same notion.

Learning Journal

Kafka Streams : Real-time Stream Processing

This book helps you understand the stream processing in general and apply that skill to Kafka streams programming.

Learning Journal

When to use <figure>

The why and where of figure tag in HTML5.

Learning Journal

Referential Transparency

Referential Transparency is an easy method to verify the purity of a function.

Learning Journal