Kafka Dashboard
Since LogScale relies on a Kafka cluster to keep LogScale running, it's important to monitor your Kafka cluster.
Ingest Queue: Out-of-Sync Partitions
In your Kafka cluster, there will be a Kafka topic called global-events. Ingested events are sent to this queue before they are stored in LogScale. LogScale's front-end will accept ingest requests, parse them, and put them in the kafka ingest queue. LogScale's back-end processes events from the queue and stores them into the datastore.
If any of those Kafka partitions under the
humio-ingest
topic become
out-of-sync, the number of partitions will be shown here.
A healthy Kafka cluster will show none of these.
Global Events Queue: Out-of-Sync Partitions
In your Kafka cluster, there will be another Kafka topic called global-events. This widget shows the number of partitions out-of-sync on this Kafka topic.
A healthy Kafka cluster will show none of these.
TransientChatter Queue: Out Of Sync Partitions
In your Kafka cluster, the other Kafka topic is called transientChatter-events. This is used for messages between LogScale nodes within the LogScale cluster. This widget will show the number of out-of-sync partitions for that particular topic.
A healthy Kafka cluster will show none of these.
Out-of-Sync Queues
This timechart will show you if any of the three Kafka topics used by LogScale have had out-of-sync replicas.
A replica is considered to be out-of-sync or lagging when it falls
sufficiently behind the leader of the partition. The replica's lag is
measured either in terms of number of messages it's behind the leader
(replica.lag.max.messages
) or
the time for which the replica has not attempted to fetch new data
from the leader
(replica.lag.time.max.ms
).
A healthy Kafka cluster should not show any topics out-of-sync.
Kafka Topic Partitions
This table is a good reference for how the topics and each of its partitions currently look in a Kafka cluster. You can also view this table by going to the Cluster Administration page in the LogScale User Interface and clicking on the Kafka Cluster page. It can show you this information in more detail.
For a healthy system, ideally you should see all partitions with
topic_is_in_sync
set
to true
, as well as having the
topic_replicas
having the same
set of nodes listed as
topic_in_sync_replicas
.
Ingest Queue Put Response Times 75th Percentile (Millis)
This is a timechart of the metric
kafka-ingestqueue-put
, which is
the time from adding an event to the ingest queue to getting an ack
back.
Ingest Queue: Uncompressed Bytes Written
This is a timechart of the metric
ingest-writer-uncompressed-bytes
.
It shows the number of bytes per second written to Kafka before being
compressed in the ingest queue. This is timechart shows the
distribution across LogScale hosts.
Ingest Queue Request Size 75th Percentiles
This timechart uses the metric which shows the number of bytes written to Kafka after compression for events in the ingest queue.
Global Requests per Second
LogScale uses Kafka to move its
global-data-snapshot.json
file
between nodes to ensure each LogScale node is always up-to-date. This
timechart uses the metric
global-publish-wait-for-value
,
which shows the time spent from pushing an update to the global
snapshot to see the value being read back from Kafka.
The timechart then shows the number of these requests being made per second per LogScale host.
Global Transactions per Second
LogScale's
global-data-snapshot.json
describes key information about the LogScale cluster. When changes are
made to the cluster, this can require an update to Global. This
timechart shows the names of the different functions that make a
change to the
global-data-snapshot.json
file.
Global Time Blocked Waiting for Write (P75) (Millis)
This uses the same metric as the Global Requests per Second time chart, but in this case it looks at the maximum 75th percentile of requests being made.
Lag Reading Ingest Queue
This timechart uses Kafka's metric
record-lag-max
, which is the
difference in messages between consumer's log offset pulling off the
ingest queue and the producer's current log offset when sending to the
ingest queue. This timechart shows this record lag for each Kafka
partition.
A healthy system should ideally have all partitions at 0, but spikes are fine.