Kafka Dashboard

Since Humio relies on a Zookeeper and Kafka cluster to keep Humio running, it’s important to monitor your Kafka cluster.

Ingest Queue: Out-of-Sync Partitions

In your Kafka cluster, there will be a Kafka topic called humio-ingest. Ingested events are sent to this queue before they are stored in Humio. Humio’s front-end will accept ingest requests, parse them, and put them in the kafka ingest queue. Humio’s back-end processes events from the queue and stores them into the datastore.

If any of those Kafka partitions under the humio-ingest topic become out-of-sync, the number of partitions will be shown here.

A healthy Kafka cluster will show none of these.

Global Events Queue: Out-of-Sync Partitions

In your Kafka cluster, there will be another Kafka topic called global-events. This widget shows the number of partitions out-of-sync on this Kafka topic.

A healthy Kafka cluster will show none of these.

TransientChatter Queue: Out Of Sync Partitions

In your Kafka cluster, the other Kafka topic is called transientChatter-events. This is used for messages between Humio nodes within the Humio cluster. This widget will show the number of out-of-sync partitions for that particular topic.

A healthy Kafka cluster will show none of these.

Out-of-Sync Queues

This timechart will show you if any of the three Kafka topics used by Humio have had out-of-sync replicas.

A replica is considered to be out-of-sync or lagging when it falls sufficiently behind the leader of the partition. The replica’s lag is measured either in terms of number of messages it’s behind the leader (replica.lag.max.messages) or the time for which the replica has not attempted to fetch new data from the leader (replica.lag.time.max.ms).

A healthy Kafka cluster should not show any topics out-of-sync.

Kafka Topic Partitions

This table is a good reference for how the topics and each of its partitions currently look in a Kafka cluster. You can also view this table by going to the Cluster Administration page in the Humio User Interface and clicking on the Kafka Cluster page. It can show you this information in more detail.

For a healthy system, ideally you should see all partitions with topic_is_in_sync set to true, as well as having the topic_replicas having the same set of nodes listed as topic_in_sync_replicas.

Ingest Queue Put Response Times 75th Pecentile (Millis)

This is a timechart of the metric kafka-ingestqueue-put, which is the time from adding an event to the ingest queue to getting an ack back.

Ingest Queue: Uncompressed Bytes Written

This is a timechart of the metric ingest-writer-uncompressed-bytes. It shows the number of bytes per second written to Kafka before being compressed in the ingest queue. This is timechart shows the distribution across Humio hosts.

Ingest Queue Request Size 75th Percentiles

This timechart uses the metric which shows the number of bytes written to Kafka after compression for events in the ingest queue.

Global Requests per Second

Humio uses Kafka to move its global-data-snapshot.json file between nodes to ensure each Humio node is always up-to-date. This timechart uses the metric global-publish-wait-for-value, which shows the time spent from pushing an update to the global snapshot to see the value being read back from Kafka.

The timechart then shows the number of these requests being made per second per Humio host.

Global Transactions per Second

Humio’s global-data-snapshot.json describes key information about the Humio cluster. When changes are made to the cluster, this can require an update to Global. This timechart shows the names of the different functions that make a change to the global-data-snapshot.json file.

Global Time Blocked Waiting for Write (P75) (Millis)

This uses the same metric as the Global: Request Per Second time chart, but in this case it looks at the maximum 75th percentile of requests being made.

Lag Reading Ingest Queue

This timechart uses Kafka’s metric record-lag-max, which is the difference in messages between consumer’s log offset pulling off the ingest queue and the producer’s current log offset when sending to the ingest queue. This timechart shows this record lag for each Kafka partition.

A healthy system should ideally have all partitions at 0, but spikes are fine.