Kafka Usage & Installation

LogScale uses Apache Kafka internally for queuing incoming messages and for storing shared state when running LogScale in a cluster setup. This page describes how LogScale uses Kafka. If you already understand Kafka concepts, you can skip this and go to the instructions on how to Install Kafka, further down this page.

For more information on Kafka configuration and settings, see Kafka Configuration.

How LogScale Uses Kafka

LogScale creates the following queues in Kafka:

You can set the environment variable HUMIO_KAFKA_TOPIC_PREFIX to add that prefix to the topic names in Kafka. Adding a prefix is recommended if you share the Kafka installation with applications other than LogScale, or with another LogScale instance. The default is not to add a prefix.

LogScale configures default retention settings on the topics when it creates them. If they exist already, LogScale does not alter retention settings on the topics.

If you wish to inspect and change the topic configurations, such as the retention settings, to match your disk space available for Kafka, please use the kafka-configs command. See below for an example, modifying the retention on the ingest queue to keep burst of data for up to one hour only.

global-events

This is LogScale's event-sourced database queue.

  • This queue has a relatively low throughput.

  • Allow messages of at least 2MB or more to allow large events:

    ini
    max.message.bytes=2097152
  • No log data is saved to this queue.

  • There should be a high number of replicas for this queue.

  • LogScale will raise the number of replicas on this queue to three if there are at least three brokers in the Kafka cluster and LogScale is allowed to manage the topic.

Default required replicas:

ini
min.insync.replicas = 2

Provided there are three brokers when LogScale creates the topic. Default retention configuration:

ini
retention.bytes = 1073741824

Which configures 1GB, and disable time based retention:

ini
retention.ms = -1

Compression should be set to:

ini
compression.type=producer
kafka-humio-ingest

Ingested events are sent to this queue, before they are stored in LogScale. LogScale's front end will accept ingest requests, parse them, and put them on the queue. LogScale's back end processes events from the queue and stores them into the datastore. This queue will have high throughput corresponding to the ingest load. The number of replicas can be configured in accordance with data size, latency and throughput requirements, and how important it is not to lose in-flight data.

LogScale defaults to two replicas on this queue, if at least two brokers exist in the Kafka cluster, and LogScale has not been told otherwise through the configuration parameter INGEST_QUEUE_REPLICATION_FACTOR, which defaults to 2. When data is stored in LogScale's own datastore, we don't need it on the queue any more.

  • Default required replicas:

    ini
    min.insync.replicas = $INGEST_QUEUE_REPLICATION_FACTOR - 1

    Provided there are enough brokers when LogScale creates the topic.

  • Default retention configuration (7 days as milliseconds):

    ini
    retention.ms = 604800000
  • Set the retention configuration on the humio-ingesttopic to:

    ini
    retention.bytes = disk_space_in_bytes_on_one_host / partitionCount

    with the actual setting based on the disk space available.

  • Compression should be set to:

    ini
    compression.type=producer
  • Allow messages of at least 8 MB to allow large events:

    ini
    max.message.bytes=8388608
  • Compaction is not allowed.

transientChatter-events

This queue is used for chatter between LogScale nodes. It is only used for transient data. LogScale will raise the number of replicas on this queue to 3 if there are at least three brokers in the Kafka cluster. The queue can have a short retention and it is not important to keep the data, as it gets stale very fast.

  • Default required replicas (provided there are three brokers when LogScale creates the topic):

    ini
    min.insync.replicas = 2
  • Default retention configuration (one hour as millis):

    ini
    retention.ms = 3600000
  • Compression should be set to:

    ini
    compression.type=producer
  • Support compaction settings allowing Kafka to retain only the latest copy:

    ini
    cleanup.policy=compact

Kafka Version

LogScale is capable of running on Kafka version 2.4.1 and greater and is usually tested against the latest Kafka version.

Note

Although any version down to 2.4.1 applies, it is strongly recommended to install the latest Kafka version possible on your environment. Find the currently available Kafka versions.

shell
## Example commands for setting protocol version on topic...
# See current config for topic, if any:
kafka-configs.sh --zookeeper localhost:2181 --describe --entity-type topics --entity-name 'humio-ingest'
# Set protocol version for topic:
kafka-configs.sh --zookeeper localhost:2181 --alter --entity-type topics --entity-name 'humio-ingest' --add-config 'message.format.version=0.11.0'
# Remove setting, allowing to use the default of the broker:
kafka-configs.sh --zookeeper localhost:2181 --alter --entity-type topics --entity-name 'humio-ingest' --delete-config 'message.format.version'

Server Preparation

We recommend installing on Ubuntu, at least version 18.04. Before installing Kafka, make sure the server is up-to-date. If you haven't already done this, you can upgrade the system with apt-get like so:

shell
$ apt-get update
$ apt-get upgrade

Next, create a non-administrative user named, kafka to run Kakfa. You can do this by executing the following from the command-line:

shell
$ adduser kafka --shell=/bin/false --no-create-home --system --group

You should add this user to the DenyUsers section of your nodes /etc/ssh/sshd_config file to prevent it from being able to ssh or sftp into the node. Remember to restart the sshd daemon after making the change. Once the system has finished updating and the user has been created, you can install Kafka.

Installation

To install Kafka, you'll need to go to the /opt directory and download the latest release. You can do that like so with wget:

logscale
$ cd /opt
$ wget https://www-us.apache.org/dist/kafka/x.x.x/kafka_x.x.x.x.tgz

You would adjust this last line, change the Xs to the latest version number. Once it downloads, untar the file and then create the directories it needs like this:

shell
$ tar zxf kafka_x.x.x.x.tgz

$ mkdir /var/log/kafka
$ mkdir /var/kafka-data
$ chown kafka:kafka /var/log/kafka
$ chown kafka:kafka /var/kafka-data

$ ln -s /opt/kafka_x.x.x.x /opt/kafka

The four lines in the middle here create the directories for Kafka's logs and data, and changes the ownership of those directories to the kafka user. The last line creates a symbolic to /opt/kafka. You would adjust that, though, replacing the Xs with the version number.

Using a simple text editor, open the Kafka properties file, server.properties, located in the kafka/config sub-directory. You'll need to set a few options — the lines below are not necessarily the order in which they'll be found in the configuration file:

ini
broker.id=1
log.dirs=/var/kafka-data
delete.topic.enable = true

The first line sets the broker.id value to match the server number (myid) you set when configuring Zookeeper. The second sets the data directory. The third line should be added to the end of the configuration file. When you're finished, save the file and change the owner to the kafka user:

shell
$ chown -R kafka:kafka /opt/kafka_x.x.x.x

You'll have to adjust this to the version you installed. Note, changing the ownership of the link /opt/kafka doesn't change the ownership of the files in the directory.

Now you'll need to create a service file for starting Kafka. Use a simple text editor to create a file named, kafka.service in the /etc/systemd/system/ sub-directory. Then add the following lines to the service file:

ini
[Unit]
Requires=zookeeper.service
After=zookeeper.service

[Service]
Type=simple
User=kafka
LimitNOFILE=800000
Environment="LOG_DIR=/var/log/kafka"
Environment="GC_LOG_ENABLED=true"
Environment="KAFKA_HEAP_OPTS=-Xms512M -Xmx4G"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
Restart=on-failure

[Install]
WantedBy=multi-user.target

For more information on Kafka configuration and settings, see Kafka Configuration.

Now you're ready to start the Kafka service. Enter the first line below to start it. When it finishes, enter the second line to check that it's running and there are no errors reported:

shell
$ systemctl start kafka
$ systemctl status kafka
$ systemctl enable kafka

After breaking out of the status by pressing q, enter the last line above to set the Kafka service to start when the server boots up.