LogScale on Bare Metal - Installing Apache Kafka Cluster
LogScale uses Apache Kafka to manage ingest processing and inter-node communication within the LogScale cluster. LogScale recommends a minimum 3-node installation.
When deploying a Kafka cluster that requires ZooKeeper, each Kafka and ZooKeeper node need a unique host ID number:
For Kafka this is the
broker.id
configuration value in theserver.properties
file:ini# The id of the broker. This must be set to a unique integer for each broker. broker.id=1
For ZooKeeper this is a file called
myid
in the ZooKeeper data directory that contains the node ID number. To create, echo the number to the file:shell$
echo 1 >/kafka/zookeeper/myid
When creating a multi-node Kafka cluster these numbers must be unique for each host:
Host |
Kafka broker.id
|
ZooKeeper myid
|
---|---|---|
kafka1 | 1 | 1 |
kafka2 | 2 | 2 |
kafka3 | 3 | 3 |
LogScale on Bare Metal - Apache Kafka Server Preparation
We recommend installing on Ubuntu, at least version 18.04. Before installing Kafka, make sure the server is up-to-date:
$ apt-get update
$ apt-get upgrade
Create a non-administrative user named
kafka
to run Kafka:
$ adduser kafka --shell=/bin/false --no-create-home --system --group
Add this user to the DenyUsers
section of each
nodes /etc/ssh/sshd_config
file to prevent it from
being able to ssh or sftp into the
node.
Restart the sshd daemon after making the change. Once the system has finished updating and the user has been created, Kafka can be installed.
LogScale on Bare Metal - Apache Kafka Installation
To install Kafka and ZooKeeper:
Go to the
/opt
directory and download the latest release. The package can be downloaded using wget:shell$
cd /opt
$wget
https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
Extract the archive and create directories it needs like this:
shell$
tar zxf
kafka_2.13-3.7.0.tgz
Now create the directories where the information will be stored. We will use the top level directory
/kafka
since that could be a mount point for a separate filesystem. We will also create a directory for application log files in/var/log/kafka
:shell$
mkdir /var/log/kafka
$mkdir /var/log/zookeeper
$mkdir /kafka/kafka
$chown kafka:kafka /var/log/kafka /var/log/zookeeper
$chown kafka:kafka /kafka/kafka
$chown kafka:kafka /kafka/zookeeper
Now link the application directory to
/opt/kafka
which will allow us to use/opt/kafka
for the application and scripts, but update the version by downloading and relinking to the updated application directory:shell$
ln -s
/opt/kafka_2.13-3.7.0
/opt/kafkaUsing a text editor, open the Kafka properties file,
server.properties
, located in thekafka/config
sub-directory. The following options should be set for optimal configurationinibroker.id=1 log.dirs=/kafka/kafka delete.topic.enable = true
The first line sets the
broker.id
value to match the server number (in themyid
file) set when configuring ZooKeeper. The second sets the data directory. The third line should be added to the end of the configuration file. Save the file and change the owner to thekafka
user:shell$
chown -R kafka:kafka /opt/kafka
Modify the directory according to the version of Kafka that has been installed. Note, changing the ownership of the link
/opt/kafka
doesn't change the ownership of the files in the directory.If deploying a multi-node Kafka cluster, make sure that each node can resolve the hostname of each other node in the cluster. One way to achieve this is to edit the
/etc/hosts
file on each node with the host information:ini192.168.1.15 kafka1 192.168.1.16 kafka2 192.168.1.17 kafka3
Important
Be aware that in some Linux distributions, the hosts file may contain a line that by default resolves the hostname to the localhost address,
127.0.0.1
. This will cause servers to only listen on the localhost address and therefore not accessible to other hosts on the network. In this case, change the line:ini127.0.1.1 kafka1 kafka1
Updating the IP address to the public address of the host.
To configure the properties for ZooKeeper, edit the
config/zookeeper.properties
file with the following options:inidataDir=/kafka/zookeeper clientPort=2181 maxClientCnxns=0 admin.enableServer=false server.1=kafka1:2888:3888 server.2=kafka2:2888:3888 server.3=kafka3:2888:3888 4lw.commands.whitelist=* tickTime=2000 initLimit=5 syncLimit=2
The
server.1
,server.2
andserver.3
configure the hostname and host-to-host ports used to communicate. These must match thebroker.id
Kafka configuration andmyid
file value.The last three lines are required by ZooKeeper in multi-node configurations to set the timing interval for communicating with the other hosts and the time limit before reporting an error.
Tip
This file can be copied to each node running ZooKeeper, as there are no node-specific configuration settings.
Set the node id for ZooKeeper on each node:
Node kafka1shell$
mkdir /kafka/zookeeper
$echo 1 > /kafka/zookeeper/myid
$chown -R kafka:kafka /kafka/zookeeper
Node kafka2shell$
mkdir /kafka/zookeeper
$echo 2 > /kafka/zookeeper/myid
$chown -R kafka:kafka /kafka/zookeeper
Node kafka3shell$
mkdir /kafka/zookeeper
$echo 3 > /kafka/zookeeper/myid
$chown -R kafka:kafka /kafka/zookeeper
Important
The number in
myid
must be unique on each host, and match thebroker.id
configured for Kafka.Create a service file for ZooKeeper so that it will run as a system service and be automatically managed to keep running.
Create the file
/etc/systemd/system/zookeeper.service
sub-directory, edit the file add the following lines:ini[Unit] [Service] Type=simple User=kafka LimitNOFILE=800000 Environment="LOG_DIR=/var/log/zookeeper" Environment="GC_LOG_ENABLED=true" Environment="KAFKA_HEAP_OPTS=-Xms512M -Xmx4G" ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/zookeeper.properties Restart=on-failure TimeoutSec=900 [Install] WantedBy=multi-user.target
Now start and enable the service:
shell$
systemctl start zookeeper
Check if the service is running by using the
status
command:shell$
systemctl status zookeeper
Output similar to the following showing
active (running)
if the service is OK:zookeeper.service Loaded: loaded (/etc/systemd/system/zookeeper.service; disabled; vendor preset: enabled) Active: active (running) since Thu 2024-03-07 05:31:36 GMT; 1s ago Main PID: 4968 (java) Tasks: 16 (limit: 1083) Memory: 24.6M CPU: 1.756s CGroup: /system.slice/zookeeper.service ??4968 java -Xms512M -Xmx4G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true "-Xlog:gc*:file=/var/log/zookeeper/zookeeper-gc.log:time,tags:filecount=10,filesize=100M" -Dcom.sun.management.> Mar 07 05:31:36 kafka1 systemd[1]: Started zookeeper.service.
This should report any issues which should be addressed before starting the service again. If everything is OK, enable the service so that it will always start on boot:
shell$
systemctl enable zookeeper
Important
When running a multi-node service, repeat this process on each node, remembering to ensure that each node has a different number in each
myid
file.Now create a service for Kafka. The configuration file is slightly different because there is a dependency added so that the system will start ZooKeeper first if it is not running before trying to Kafka.
Create the file
/etc/systemd/system/zookeeper.service
sub-directory, edit the file add the following lines:ini[Unit] Requires=zookeeper.service After=zookeeper.service [Service] Type=simple User=kafka LimitNOFILE=800000 Environment="LOG_DIR=/var/log/kafka" Environment="GC_LOG_ENABLED=true" Environment="KAFKA_HEAP_OPTS=-Xms512M -Xmx4G" ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties Restart=on-failure TimeoutSec=900 [Install] WantedBy=multi-user.target
Now start the Kafka service:
shell$
systemctl start kafka
$systemctl status kafka
$systemctl enable kafka
These steps must be repeated on each host in a multi-node deployment.