This dashboard shows the key widgets to look for when monitoring a Humio cluster to get a sense of the overall health of the cluster.
This widget shows how much Ingest each Humio node is receiving in bytes per day.
The distribution of ingest a node receives is usually dictated by the number of Digest Partitions configured in the Cluster Administration page per host. If a node is receiving too little or too much ingest, compared to other nodes, you may want to Re-configure the Digest Partitions so they are distributed evenly.
This shows how much CPU resources each Humio node is utilising. Within a cluster, if each Humio node has the same specifications, and digest and storage partitions are evenly distributed. you would expect each Humio node to have about the same CPU usage.
If some nodes are experiencing particularly high usage, this may indicate that something is wrong with Humio or the cluster setup.
This is the number of segments to be queued for search by a query per vHost (i.e., Humio Node ID). When a query is run by a user or a dashboard or an alert, Humio needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren’t available, queries get put into a queue.
Ideally, this value per Humio node is kept at 0. That means that all Humio nodes don’t have to wait to scan segments as soon as it gets the query. Spikes can be expected, especially during times when more queries are received than usual. A constant queue, however, could indicate built up load on the nodes, which will mean slow queries.
This is a timechart of the Humio Metric named
data-ingester-errors. It shows the errors per second for each repository in which there was an error parsing an event. To investigate, you can run a query in the repository affected by the errors that looks like this:
@error=true | groupby(@error_msg)
This will show you all of the ingest error messages. It should give you an indication as to what went wrong.
This a very important metric in Humio as it can indicate slowness in the cluster. This timechart shows the average and median of the ingest latency metric. Ingest latency is defined as the time taken from an event being inserted into the ingest queue to then being digested — before being parsed — updating live queries and adding the event to blocks ready for segment files.
Ideally, keeping this value less than 10 seconds per node is a sign that the cluster is healthy.
Continuous increases in latency on one or more nodes can suggest problems. This is usually because Humio is not digesting as fast as it’s ingesting. This could mean Humio is sending too much data than what the capabilities of its resources, or the resources are being used elsewhere.
Humio has a threshold built in that will start rejecting events from log shippers if Ingest Latency reaches a certain limit. See reference page on MAX_INGEST_DELAY_SECONDS.
This shows the top Humio ERRORs in a cluster. The format is
"$humioClass | $errorMessage | $exception". This might give you an indication of issues in a cluster.
This is a timechart of the Errors Grouped over time.