LogScale on Bare Metal - Installing Apache Kafka Cluster

LogScale uses Apache Kafka to manage ingest processing and inter-node communication within the LogScale cluster. LogScale recommends a minimum 3-node installation.

When deploying a Kafka cluster that requires ZooKeeper, each Kafka and ZooKeeper node need a unique host ID number:

  • For Kafka this is the broker.id configuration value in the server.properties file:

    ini
    # The id of the broker. This must be set to a unique integer for each broker.
    broker.id=1
  • For ZooKeeper this is a file called myid in the ZooKeeper data directory that contains the node ID number. To create, echo the number to the file:

    shell
    $ echo 1 >/kafka/zookeeper/myid

When creating a multi-node Kafka cluster these numbers must be unique for each host:

Host Kafka broker.id ZooKeeper myid
kafka1 1 1
kafka2 2 2
kafka3 3 3

LogScale on Bare Metal - Apache Kafka Server Preparation

We recommend installing on Ubuntu, at least version 18.04. Before installing Kafka, make sure the server is up-to-date:

shell
$ apt-get update
$ apt-get upgrade

Create a non-administrative user named kafka to run Kafka:

shell
$ adduser kafka --shell=/bin/false --no-create-home --system --group

Add this user to the DenyUsers section of each nodes /etc/ssh/sshd_config file to prevent it from being able to ssh or sftp into the node.

Restart the sshd daemon after making the change. Once the system has finished updating and the user has been created, Kafka can be installed.

LogScale on Bare Metal - Apache Kafka Installation

To install Kafka and ZooKeeper:

  1. Go to the /opt directory and download the latest release. The package can be downloaded using wget:

    shell
    $ cd /opt
    $ wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
  2. Extract the archive and create directories it needs like this:

    shell
    $ tar zxf kafka_2.13-3.7.0.tgz
  3. Now create the directories where the information will be stored. We will use the top level directory /kafka since that could be a mount point for a separate filesystem. We will also create a directory for application log files in /var/log/kafka:

    shell
    $ mkdir /var/log/kafka
    $ mkdir /var/log/zookeeper
    $ mkdir /kafka/kafka
    $ chown kafka:kafka /var/log/kafka /var/log/zookeeper
    $ chown kafka:kafka /kafka/kafka
    $ chown kafka:kafka /kafka/zookeeper

    Now link the application directory to /opt/kafka which will allow us to use /opt/kafka for the application and scripts, but update the version by downloading and relinking to the updated application directory:

    shell
    $ ln -s /opt/kafka_2.13-3.7.0 /opt/kafka
  4. Using a text editor, open the Kafka properties file, server.properties, located in the kafka/config sub-directory. The following options should be set for optimal configuration

    ini
    broker.id=1
    log.dirs=/kafka/kafka
    delete.topic.enable = true

    The first line sets the broker.id value to match the server number (in the myid file) set when configuring ZooKeeper. The second sets the data directory. The third line should be added to the end of the configuration file. Save the file and change the owner to the kafka user:

    shell
    $ chown -R kafka:kafka /opt/kafka

    Modify the directory according to the version of Kafka that has been installed. Note, changing the ownership of the link /opt/kafka doesn't change the ownership of the files in the directory.

  5. If deploying a multi-node Kafka cluster, make sure that each node can resolve the hostname of each other node in the cluster. One way to achieve this is to edit the /etc/hosts file on each node with the host information:

    ini
    192.168.1.15 kafka1
    192.168.1.16 kafka2
    192.168.1.17 kafka3

    Important

    Be aware that in some Linux distributions, the hosts file may contain a line that by default resolves the hostname to the localhost address, 127.0.0.1. This will cause servers to only listen on the localhost address and therefore not accessible to other hosts on the network. In this case, change the line:

    ini
    127.0.1.1 kafka1 kafka1

    Updating the IP address to the public address of the host.

  6. To configure the properties for ZooKeeper, edit the config/zookeeper.properties file with the following options:

    ini
    dataDir=/kafka/zookeeper
    clientPort=2181
    maxClientCnxns=0
    admin.enableServer=false
    server.1=kafka1:2888:3888
    server.2=kafka2:2888:3888
    server.3=kafka3:2888:3888
    4lw.commands.whitelist=*
    tickTime=2000
    initLimit=5
    syncLimit=2

    The server.1, server.2 and server.3 configure the hostname and host-to-host ports used to communicate. These must match the broker.id Kafka configuration and myid file value.

    The last three lines are required by ZooKeeper in multi-node configurations to set the timing interval for communicating with the other hosts and the time limit before reporting an error.

    Tip

    This file can be copied to each node running ZooKeeper, as there are no node-specific configuration settings.

  7. Set the node id for ZooKeeper on each node:

    Node kafka1
    shell
    $ mkdir /kafka/zookeeper
    $ echo 1 > /kafka/zookeeper/myid
    $ chown -R kafka:kafka /kafka/zookeeper
    Node kafka2
    shell
    $ mkdir /kafka/zookeeper
    $ echo 2 > /kafka/zookeeper/myid
    $ chown -R kafka:kafka /kafka/zookeeper
    Node kafka3
    shell
    $ mkdir /kafka/zookeeper
    $ echo 3 > /kafka/zookeeper/myid
    $ chown -R kafka:kafka /kafka/zookeeper

    Important

    The number in myid must be unique on each host, and match the broker.id configured for Kafka.

  8. Create a service file for ZooKeeper so that it will run as a system service and be automatically managed to keep running.

    Create the file /etc/systemd/system/zookeeper.service sub-directory, edit the file add the following lines:

    ini
    [Unit]
    
    [Service]
    Type=simple
    User=kafka
    LimitNOFILE=800000
    Environment="LOG_DIR=/var/log/zookeeper"
    Environment="GC_LOG_ENABLED=true"
    Environment="KAFKA_HEAP_OPTS=-Xms512M -Xmx4G"
    ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/zookeeper.properties
    Restart=on-failure
    TimeoutSec=900
    
    [Install]
    WantedBy=multi-user.target

    Now start and enable the service:

    shell
    $ systemctl start zookeeper

    Check if the service is running by using the status command:

    shell
    $ systemctl status zookeeper

    Output similar to the following showing active (running) if the service is OK:

    zookeeper.service
         Loaded: loaded (/etc/systemd/system/zookeeper.service; disabled; vendor preset: enabled)
         Active: active (running) since Thu 2024-03-07 05:31:36 GMT; 1s ago
       Main PID: 4968 (java)
          Tasks: 16 (limit: 1083)
         Memory: 24.6M
            CPU: 1.756s
         CGroup: /system.slice/zookeeper.service
                 ??4968 java -Xms512M -Xmx4G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true "-Xlog:gc*:file=/var/log/zookeeper/zookeeper-gc.log:time,tags:filecount=10,filesize=100M" -Dcom.sun.management.>
    
    Mar 07 05:31:36 kafka1 systemd[1]: Started zookeeper.service.

    This should report any issues which should be addressed before starting the service again. If everything is OK, enable the service so that it will always start on boot:

    shell
    $ systemctl enable zookeeper

    Important

    When running a multi-node service, repeat this process on each node, remembering to ensure that each node has a different number in each myid file.

  9. Now create a service for Kafka. The configuration file is slightly different because there is a dependency added so that the system will start ZooKeeper first if it is not running before trying to Kafka.

    Create the file /etc/systemd/system/zookeeper.service sub-directory, edit the file add the following lines:

    ini
    [Unit]
    Requires=zookeeper.service
    After=zookeeper.service
    
    [Service]
    Type=simple
    User=kafka
    LimitNOFILE=800000
    Environment="LOG_DIR=/var/log/kafka"
    Environment="GC_LOG_ENABLED=true"
    Environment="KAFKA_HEAP_OPTS=-Xms512M -Xmx4G"
    ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
    Restart=on-failure
    TimeoutSec=900
    
    [Install]
    WantedBy=multi-user.target
  10. Now start the Kafka service:

    shell
    $ systemctl start kafka
    $ systemctl status kafka
    $ systemctl enable kafka

    These steps must be repeated on each host in a multi-node deployment.