Insights Errors Dashboard

This page provides a more in-depth description of each of the widgets from the Humio Errors Dashboard.

Errors Grouped

This shows the top Humio ERRORs in a cluster. The format is "$humioClass | $errorMessage | $exception". This might give you an indication of potential and real problems in your cluster.

Errors Over Time

This is a timechart of the Errors Grouped over time.

Missing Nodes

This is a Humio Metrics under the name, missing-cluster-nodes. This metric is provided by each Humio node and shows the number of nodes that each node has reported as dead. A healthy system should have none of these.

If this widget shows values greater than 0, you should determine which node is missing. To do this, go to the Cluster Administraion page to see which nodes are offline.

Node Shutdowns

This is a timechart showing which vHost (i.e., Humio Node ID) has shutdown in a given time range. If a node is shutdown unexpectedly, there could be ERROR log entries explaining why.

Check the Cluster Administration page to see if any Humio nodes are currently done. Run the query below if a shutdown is unexpected:

humio
#vhost=<HumioNodeID> loglevel=ERROR

Failed HTTP Checks

This is a Node Level Metrics named failed-http-checks. It's the number of nodes that appear to be unreachable using http, reported by each Humio node.

CPU Usage in Percent

This is the CPU usage per vHost (i.e., Humio Node ID). It's an important metric since high CPU usage can indicate problems. Potential problems could be, but no limited to the following:

  • The system is ingesting more than it can digest and spending all of it's resources digesting; or

  • Inefficient parsers or queries that consume all of the CPU.

Ingest Latency

A Node Level Metrics named event-latency. It shows the overall latency between the ingest queue and the digest pipeline. It's the average difference in time per Humio node for an event to be inserted into the ingest queue to then be added as a block to be stored in a segment file.

Keeping this value ideally less than 10 seconds per node is a sign that it's healthy.

Continuous increases in latency on one or more nodes can suggest problems. This is usually because Humio is not digesting as fast as it's ingesting, which could mean too much data is being sent compared to capabilities of the resources — or it indicates that resources are being used else where.

Search Queue

This is the number of segments to be queued for search by a query per vHost (i.e., Humio Node ID). When a query is run by a user or a dashboard or an alert, Humio needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren't available, queries are put in a queue until they are available.

Ideally, this value per Humio nodes is kept at 0. That means that all Humio nodes don't have to wait to scan segments as soon as it receives a query. Spikes can be common, especially during times where more queries are received than usual. A constant queue, however, could indicate built up load on the nodes, which will cause slow queries.

HTTP Errors 500s

This provides a timechart showing HTTP 500 errors across all Humio nodes. HTTP 500 is an Internal server error. This will usually correlate to Humio ERROR logs. To search for error logs, you can refer to the Errors Grouped dashboard widget.

Missing Segments Reported by ClusterManagement

The Cluster Management page, which is accessible by Humio Root users, shows information about the Humio cluster. It includes information related to missing segments. This timechart shows which Humio nodes have reported missing segments. A missing segment means a Humio node is supposed to have a segment file locally stored on its machine, but it's not there.

To investigate this, you may want to look at which segments are missing using the Missing Segments, Once you do that, you can investigate which Humio nodes should hold this segment.

You can then check in the global-data-snapshot.json file to see which Humio host should hold this segment by running this command on any Humio node:

humio
cat global-data-snapshot.json | grep -A10 <segmentID>

There should be two fields in there called currentHosts and ownerHosts. The number in those fields corresponds to the Humio node ID that the segment file should be on. You can then run this command on the Humio node to see if that segment file is actually present:

humio
locate <segmentID>`

If it's present then contact Humio support to investigate more. If not, the segment file has probably been deleted. In which case, you can then use the Missing Segments to remove the Missing Segments warning from Humio.

Alert Action Errors

This is a timechart showing the "$dataspace/$alertName" when the alert tried to fire actions, but failed with an error. Looking into the Alerts page for that repository will usually list the error under the alert name. You can read more about this on the Actions documentation page.

Alerts With Other Errors

This is a timechart showing the "$dataspace/$alertName" when there is an error with alerts other than errors with firing actions. Looking into the Alerts page for that repository will usually list the error under the alert name. You can read more about this on the Actions documentation page.

FDR Ingest Errors

This is a timechart showing the "$dataspace/$fdrFeedName" when there is an error with FDR ingest. You can read more about this on the Error Handling for FDR Ingestion documentation page.

Scheduled Search Action Errors

This is a timechart showing the "$dataspace/$scheduledSearchName" when the scheduled search tried to fire actions, but failed with an error. Looking into the Scheduled Searches page for that repository will usually list the error under the scheduled search name. You can read more about this on the Actions documentation page.

Scheduled Searches With Other Errors

This is a timechart showing the "$dataspace/$scheduledSearchName" when there is an error with scheduled searches other than errors with firing actions. Looking into the Scheduled Searches page for that repository will usually list the error under the scheduled search name. You can read more about this on the Actions documentation page.

Slow Warnings

This is a timechart showing the number of "Humio is slow" warnings for a query per vHost (i.e., Humio Node ID). You can typically see this warning if you try to run a query on the node that is slow. This is usually related to the Ingest Latency of a node. Reducing ingest latency for the node with Slow Warnings should stop this.

Ingest Errors

This is a timechart of the Node Level Metrics named data-ingester-errors. It shows the errors per second for each repository in which there was an error parsing an event. To investigate, you can run a query in the repository affected by the errors that looks like this:

humio
@error=true | groupby(@error_msg)

This will show you all of the ingest error messages. That should give you an indication as to what went wrong.

Global Snapshot File Size

In Humio there is a file called, global-data-snapshot.json. It's also known as the Architecture of Humio Global file, which essentially holds all of the key information about the Humio cluster and is constantly updated across all nodes. It's where Humio stores all metadata on repositories, users and all the other objects you can create through the User Interface. It also holds the metadata on the segment files that hold the events shipped to Humio.

This Global file is handled by Humio and should be kept as small as possible to maintain high performance within Humio. A healthy system should not see the Global snapshot file exceed 1GB — ideally, less than 500 MB. If this is not the case, you should discuss it with Humio support.