Insights Errors Dashboard
This page provides a more in-depth description of each of the widgets from the LogScale Errors Dashboard.
Errors Grouped
This shows the top LogScale ERRORs in a cluster. The format is
"$humioClass | $errorMessage |
$exception"
. This might give you an indication of potential
and real problems in your cluster.
Errors Over Time
This is a timechart of the Errors Grouped over time.
Missing Nodes
This is a LogScale Metrics under the name,
missing-cluster-nodes
.
This metric is provided by each LogScale node and shows the number of
nodes that each node has reported as dead. A healthy system should
have none of these.
If this widget shows values greater than 0, you should determine which node is missing. To do this, go to the Cluster Administration page to see which nodes are offline.
Node Shutdowns
This is a timechart showing which vHost (i.e. LogScale Node ID) has
shutdown in a given time range. If a node is shutdown unexpectedly,
there could be ERROR
log entries
explaining why.
Check the Cluster Administration page to see if any LogScale nodes are currently done. Run the query below if a shutdown is unexpected:
#vhost=<LogScaleNodeID> loglevel=ERROR
Failed HTTP Checks
This is a Node-Level Metrics named
failed-http-checks
.
It's the number of nodes that appear to be unreachable using
http
, reported by each LogScale
node.
CPU Usage in Percent
This is the CPU usage per vHost (i.e. LogScale Node ID). It's an important metric since high CPU usage can indicate problems. Potential problems could be, but no limited to the following:
The system is ingesting more than it can digest and spending all of it's resources digesting; or
Inefficient parsers or queries that consume all of the CPU.
Ingest Latency
A Node-Level Metrics named
event-latency
. It shows the
overall latency between the ingest queue and the digest pipeline. It's
the average difference in time per LogScale node for an event to be
inserted into the ingest queue to then be added as a block to be
stored in a segment file.
Keeping this value ideally less than 10 seconds per node is a sign that it's healthy.
Continuous increases in latency on one or more nodes can suggest problems. This is usually because LogScale is not digesting as fast as it's ingesting, which could mean too much data is being sent compared to capabilities of the resources — or it indicates that resources are being used else where.
Search Queue
This is the number of segments to be queued for search by a query per vHost (i.e. LogScale Node ID). When a query is run by a user or a dashboard or an alert, LogScale needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren't available, queries are put in a queue until they are available.
Ideally, this value per LogScale nodes is kept at 0. That means that all LogScale nodes don't have to wait to scan segments as soon as it receives a query. Spikes can be common, especially during times where more queries are received than usual. A constant queue, however, could indicate built up load on the nodes, which will cause slow queries.
HTTP Errors 500s
This provides a timechart showing HTTP 500 errors across all LogScale
nodes. HTTP 500 is an Internal server error. This will usually
correlate to LogScale ERROR
logs. To search for error logs, you can refer to the Errors Grouped
dashboard widget.
Missing Segments Reported by ClusterManagement
The Cluster Management page, which is accessible by LogScale Root users, shows information about the LogScale cluster. It includes information related to missing segments. This timechart shows which LogScale nodes have reported missing segments. A missing segment means a LogScale node is supposed to have a segment file locally stored on its machine, but it's not there.
To investigate this, you may want to look at which segments are missing using the Missing Segments, Once you do that, you can investigate which LogScale nodes should hold this segment.
You can then check in the
global-data-snapshot.json
file
to see which LogScale host should hold this segment by running this
command on any LogScale node:
$ cat global-data-snapshot.json | grep -A10 segmentID
There should be two fields in there called
currentHosts
and
ownerHosts
. The number in those
fields corresponds to the LogScale node ID that the segment file
should be on. You can then run this command on the LogScale node to
see if that segment file is actually present:
$ locate segmentID
If it's present then contact LogScale support to investigate more. If not, the segment file has probably been deleted. In which case, you can then use the Missing Segments to remove the Missing Segments warning from LogScale.
Alert Action Errors
This is a timechart showing the
$dataspace/$alertName
when the
alert tried to fire actions, but failed with an error.
Looking into the Alerts page for that repository will
usually list the error under the alert name. You can
read more about this on the Actions
documentation page.
Alerts With Other Errors
This is a timechart showing the
$dataspace/$alertName
when there
is an error with alerts other than errors with firing
actions. Looking into the Alerts page for that
repository will usually list the error under the alert
name. You can read more about this on the
Actions documentation page.
FDR Ingest Errors
This is a timechart showing the
"$dataspace/$fdrFeedName"
when
there is an error with FDR ingest. You can read more about this on the
Error Handling for FDR Ingestion documentation page.
Scheduled Search Action Errors
This is a timechart showing the
"$dataspace/$scheduledSearchName"
when the scheduled search tried to fire actions, but failed with an
error. Looking into the Scheduled Searches page for that repository
will usually list the error under the scheduled search name. You can
read more about this on the Actions
documentation page.
Scheduled Searches With Other Errors
This is a timechart showing the
"$dataspace/$scheduledSearchName"
when there is an error with scheduled searches other than errors with
firing actions. Looking into the Scheduled Searches page for that
repository will usually list the error under the scheduled search
name. You can read more about this on the
Actions documentation page.
Slow Warnings
This is a timechart showing the number of "LogScale is slow" warnings for a query per vHost (i.e. LogScale Node ID). You can typically see this warning if you try to run a query on the node that is slow. This is usually related to the Ingest Latency of a node. Reducing ingest latency for the node with Slow Warnings should stop this.
Ingest Errors
This is a timechart of the Node-Level Metrics named
data-ingester-errors
. It shows
the errors per second for each repository in which there was an error
parsing an event. To investigate, you can run a query in the
repository affected by the errors that looks like this:
@error=true
| groupby(@error_msg)
This will show you all of the ingest error messages. That should give you an indication as to what went wrong.
Global Snapshot File Size
In LogScale there is a file called,
global-data-snapshot.json
. It's
also known as the
LogScale Internal Architecture Global file,
which essentially holds all of the key information about the LogScale
cluster and is constantly updated across all nodes. It's where
LogScale stores all metadata on repositories, users and all the other
objects you can create through the User Interface. It also holds the
metadata on the segment files that hold the events shipped to
LogScale.
This Global file is handled by LogScale and should be kept as small as possible to maintain high performance within LogScale. A healthy system should not see the Global snapshot file exceed 1GB — ideally, less than 500 MB. If this is not the case, you should discuss it with LogScale support.