Insights Errors Dashboard

This page provides a more in-depth description of each of the widgets from the LogScale Errors Dashboard.

Errors Grouped

This shows the top LogScale ERRORs in a cluster. The format is "$humioClass | $errorMessage | $exception". This might give you an indication of potential and real problems in your cluster.

Errors Over Time

This is a timechart of the Errors Grouped over time.

Missing Nodes

This is a LogScale Metrics under the name, missing-cluster-nodes. This metric is provided by each LogScale node and shows the number of nodes that each node has reported as dead. A healthy system should have none of these.

If this widget shows values greater than 0, you should determine which node is missing. To do this, go to the Cluster Administration page to see which nodes are offline.

Node Shutdowns

This is a timechart showing which vHost (i.e. LogScale Node ID) has shutdown in a given time range. If a node is shutdown unexpectedly, there could be ERROR log entries explaining why.

Check the Cluster Administration page to see if any LogScale nodes are currently done. Run the query below if a shutdown is unexpected:

logscale
#vhost=<LogScaleNodeID> loglevel=ERROR

Failed HTTP Checks

This is a Node-Level Metrics named failed-http-checks. It's the number of nodes that appear to be unreachable using http, reported by each LogScale node.

CPU Usage in Percent

This is the CPU usage per vHost (i.e. LogScale Node ID). It's an important metric since high CPU usage can indicate problems. Potential problems could be, but no limited to the following:

  • The system is ingesting more than it can digest and spending all of it's resources digesting; or

  • Inefficient parsers or queries that consume all of the CPU.

Ingest Latency

A Node-Level Metrics named event-latency. It shows the overall latency between the ingest queue and the digest pipeline. It's the average difference in time per LogScale node for an event to be inserted into the ingest queue to then be added as a block to be stored in a segment file.

Keeping this value ideally less than 10 seconds per node is a sign that it's healthy.

Continuous increases in latency on one or more nodes can suggest problems. This is usually because LogScale is not digesting as fast as it's ingesting, which could mean too much data is being sent compared to capabilities of the resources — or it indicates that resources are being used else where.

Search Queue

This is the number of segments to be queued for search by a query per vHost (i.e. LogScale Node ID). When a query is run by a user or a dashboard or an alert, LogScale needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren't available, queries are put in a queue until they are available.

Ideally, this value per LogScale nodes is kept at 0. That means that all LogScale nodes don't have to wait to scan segments as soon as it receives a query. Spikes can be common, especially during times where more queries are received than usual. A constant queue, however, could indicate built up load on the nodes, which will cause slow queries.

HTTP Errors 500s

This provides a timechart showing HTTP 500 errors across all LogScale nodes. HTTP 500 is an Internal server error. This will usually correlate to LogScale ERROR logs. To search for error logs, you can refer to the Errors Grouped dashboard widget.

Missing Segments Reported by ClusterManagement

The Cluster Management page, which is accessible by LogScale Root users, shows information about the LogScale cluster. It includes information related to missing segments. This timechart shows which LogScale nodes have reported missing segments. A missing segment means a LogScale node is supposed to have a segment file locally stored on its machine, but it's not there.

To investigate this, you may want to look at which segments are missing using the Missing Segments, Once you do that, you can investigate which LogScale nodes should hold this segment.

You can then check in the global-data-snapshot.json file to see which LogScale host should hold this segment by running this command on any LogScale node:

shell
$ cat global-data-snapshot.json | grep -A10 segmentID

There should be two fields in there called currentHosts and ownerHosts. The number in those fields corresponds to the LogScale node ID that the segment file should be on. You can then run this command on the LogScale node to see if that segment file is actually present:

shell
$ locate segmentID

If it's present then contact LogScale support to investigate more. If not, the segment file has probably been deleted. In which case, you can then use the Missing Segments to remove the Missing Segments warning from LogScale.

Alert Action Errors

This is a timechart showing the $dataspace/$alertName when the alert tried to fire actions, but failed with an error. Looking into the Alerts page for that repository will usually list the error under the alert name. You can read more about this on the Actions documentation page.

Alerts With Other Errors

This is a timechart showing the $dataspace/$alertName when there is an error with alerts other than errors with firing actions. Looking into the Alerts page for that repository will usually list the error under the alert name. You can read more about this on the Actions documentation page.

FDR Ingest Errors

This is a timechart showing the "$dataspace/$fdrFeedName" when there is an error with FDR ingest. You can read more about this on the Error Handling for FDR Ingestion documentation page.

Scheduled Search Action Errors

This is a timechart showing the "$dataspace/$scheduledSearchName" when the scheduled search tried to fire actions, but failed with an error. Looking into the Scheduled Searches page for that repository will usually list the error under the scheduled search name. You can read more about this on the Actions documentation page.

Scheduled Searches With Other Errors

This is a timechart showing the "$dataspace/$scheduledSearchName" when there is an error with scheduled searches other than errors with firing actions. Looking into the Scheduled Searches page for that repository will usually list the error under the scheduled search name. You can read more about this on the Actions documentation page.

Slow Warnings

This is a timechart showing the number of "LogScale is slow" warnings for a query per vHost (i.e. LogScale Node ID). You can typically see this warning if you try to run a query on the node that is slow. This is usually related to the Ingest Latency of a node. Reducing ingest latency for the node with Slow Warnings should stop this.

Ingest Errors

This is a timechart of the Node-Level Metrics named data-ingester-errors. It shows the errors per second for each repository in which there was an error parsing an event. To investigate, you can run a query in the repository affected by the errors that looks like this:

logscale
@error=true 
| groupby(@error_msg)

This will show you all of the ingest error messages. That should give you an indication as to what went wrong.

Global Snapshot File Size

In LogScale there is a file called, global-data-snapshot.json. It's also known as the LogScale Internal Architecture Global file, which essentially holds all of the key information about the LogScale cluster and is constantly updated across all nodes. It's where LogScale stores all metadata on repositories, users and all the other objects you can create through the User Interface. It also holds the metadata on the segment files that hold the events shipped to LogScale.

This Global file is handled by LogScale and should be kept as small as possible to maintain high performance within LogScale. A healthy system should not see the Global snapshot file exceed 1GB — ideally, less than 500 MB. If this is not the case, you should discuss it with LogScale support.