Humio Insights Errors Dashboard

This page provides a more in-depth description of each of the widgets from the Humio Errors Dashboard.

Errors Grouped

This shows the top Humio ERRORs in a cluster. The format is "$humioClass | $errorMessage | $exception". This might give you an indication of potential and real problems in your cluster.

Errors Over Time

This is a timechart of the Errors Grouped over time.

Missing Nodes

This is a Humio Metric under the name, missing-cluster-nodes. This metric is provided by each Humio node and shows the number of nodes that each node has reported as dead. A healthy system should have none of these.

If this widget shows values greater than 0, you should determine which node is missing. To do this, go to the Cluster Administraion page to see which nodes are offline.

Node Shutdowns

This is a timechart showing which vHost (i.e., Humio Node ID) has shutdown in a given time range. If a node is shutdown unexpectedly, there could be ERROR log entries explaining why.

Check the Cluster Administration page to see if any Humio nodes are currently done. Run the query below if a shutdown is unexpected:

humio
#vhost=<HumioNodeID> loglevel=ERROR

Failed HTTP Checks

This is a Humio Metric named failed-http-checks. It’s the number of nodes that appear to be unreachable using http, reported by each Humio node.

CPU Usage in Percent

This is the CPU usage per vHost (i.e., Humio Node ID). It’s an important metric since high CPU usage can indicate problems. Potential problems could be, but no limited to the following:

  • The system is ingesting more than it can digest and spending all of it’s resources digesting; or

  • Inefficient parsers or queries that consume all of the CPU.

Ingest Latency

A Humio Metric named event-latency. It shows the overall latency between the ingest queue and the digest pipeline. It’s the average difference in time per Humio node for an event to be inserted into the ingest queue to then be added as a block to be stored in a segment file.

Keeping this value ideally less than 10 seconds per node is a sign that it’s healthy.

Continuous increases in latency on one or more nodes can suggest problems. This is usually because Humio is not digesting as fast as it’s ingesting, which could mean too much data is being sent compared to capabilities of the resources—or it indicates that resources are being used else where.

Search Queue

This is the number of segments to be queued for search by a query per vHost (i.e., Humio Node ID). When a query is run by a user or a dashboard or an alert, Humio needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren’t available, queries are put in a queue until they are available.

Ideally, this value per Humio nodes is kept at 0. That means that all Humio nodes don’t have to wait to scan segments as soon as it receives a query. Spikes can be common, especially during times where more queries are received than usual. A constant queue, however, could indicate built up load on the nodes, which will cause slow queries.

HTTP Errors 500s

This provides a timechart showing HTTP 500 errors across all Humio nodes. HTTP 500 is an Internal server error. This will usually correlate to Humio ERROR logs. To search for error logs, you can refer to the Errors Grouped dashboard widget.

Missing Segments Reported by ClusterManagement

The Cluster Management page, which is accessible by Humio Root users, shows information about the Humio cluster. It includes information related to missing segments. This timechart shows which Humio nodes have reported missing segments. A missing segment means a Humio node is supposed to have a segment file locally stored on its machine, but it’s not there.

To investigate this, you may want to look at which segments are missing using the Missing Segments API, Once you do that, you can investigate which Humio nodes should hold this segment.

You can then check in the global-data-snapshot.json file to see which Humio host should hold this segment by running this command on any Humio node:

shell
cat global-data-snapshot.json | grep -A10 <segmentID>

There should be two fields in there called currentHosts and ownerHosts. The number in those fields corresponds to the Humio node ID that the segment file should be on. You can then run this command on the Humio node to see if that segment file is actually present:

humio
locate <segmentID>`

If it’s present then contact Humio support to investigate more. If not, the segment file has probably been deleted. In which case, you can then use the Delete Missing Segments API to remove the Missing Segments warning from Humio.

Alert Notification Errors & Alert With Errors

This is a timechart showing the "$dataspace/$alertName" when the alert tried to fire, but failed with an error. Looking into the Alerts page for that repository will usually list the error under the alert name. You can read more about this on the Errors & Warnings documentation page.

Alerts with Warnings (Not firing)

This is useful if the query that is used in an alert receives a warning while running: for example, “Humio is slow” or “Exceeded Groupby limit”. The alert won’t fire then, by default. To resolve this, you might fix the warnings or set the environment variable, ALERT_DESPITE_WARNINGS to true. You can read more about it’s behaviour in the Errors & Warnings documentation.

Slow Warnings

This is a timechart showing the number of “Humio is slow” warnings for a query per vHost (i.e., Humio Node ID). You can typically see this warning if you try to run a query on the node that is slow. This is usually related to the Ingest Latency of a node. Reducing ingest latency for the node with Slow Warnings should stop this.

Ingest Errors

This is a timechart of the Humio Metric named data-ingester-errors. It shows the errors per second for each repository in which there was an error parsing an event. To investigate, you can run a query in the repository affected by the errors that looks like this:

humio
@error=true | groupby(@error_msg)

This will show you all of the ingest error messages. That should give you an indication as to what went wrong.

Global Snapshot File Size

In Humio there is a file called, global-data-snapshot.json. It’s also known as the Global file, which essentially holds all of the key information about the Humio cluster and is constantly updated across all nodes. It’s where Humio stores all metadata on repositories, users and all the other objects you can create through the User Interface. It also holds the metadata on the segment files that hold the events shipped to Humio.

This Global file is handled by Humio and should be kept as small as possible to maintain high performance within Humio. A healthy system should not see the Global snapshot file exceed 1GB—ideally, less than 500 MB. If this is not the case, you should discuss it with Humio support.