Insights Overview Dashboard

This dashboard shows the key widgets to look for when monitoring a LogScale cluster to get a sense of the overall health of the cluster.

Ingest per Host

This widget shows how much Ingest each LogScale node is receiving in bytes per day.

The distribution of ingest a node receives is usually dictated by the number of Digest Partitions configured in the Cluster Administration page per host. If a node is receiving too little or too much ingest, compared to other nodes, you may want to Digest Partitions so they are distributed evenly.

CPU Usage in Percent

This shows how much CPU resources each LogScale node is using. Within a cluster, if each LogScale node has the same specifications, and digest partitions are evenly distributed, you would expect each LogScale node to have about the same CPU usage.

If some nodes are experiencing particularly high usage, this may indicate that something is wrong with LogScale or the cluster setup.

Search Queue

This is the number of segments to be queued for search by a query per vHost (i.e. LogScale Node ID). When a query is run by a user or a dashboard or an alert, LogScale needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren't available, queries get put into a queue.

Ideally, this value per LogScale node is kept at 0. That means that all LogScale nodes don't have to wait to scan segments as soon as it gets the query. Spikes can be expected, especially during times when more queries are received than usual. A constant queue, however, could indicate built up load on the nodes, which will mean slow queries.

Ingest Errors

This is a timechart of the Node-Level Metrics named data-ingester-errors. It shows the errors per second for each repository in which there was an error parsing an event. To investigate, you can run a query in the repository affected by the errors that looks like this:

logscale
@error=true 
| groupby(@error_msg)

This will show you all of the ingest error messages. It should give you an indication as to what went wrong.

Ingest Latency Per Host (Digest)

This a very important metric in LogScale as it can indicate slowness in the cluster. This timechart shows the average and median of the ingest latency metric. Ingest latency is defined as the time taken from an event being inserted into the ingest queue to then being digested — before being parsed — updating live queries and adding the event to blocks ready for segment files.

Ideally, keeping this value less than 10 seconds per node is a sign that the cluster is healthy.

Continuous increases in latency on one or more nodes can suggest problems. This is usually because LogScale is not digesting as fast as it's ingesting. This could mean LogScale is sending too much data than what the capabilities of its resources, or the resources are being used elsewhere.

LogScale has a threshold built in that will start rejecting events from log shippers if Ingest Latency reaches a certain limit. See reference page on MAX_INGEST_DELAY_SECONDS.

Errors Grouped

This shows the top LogScale ERRORs in a cluster. The format is "$humioClass | $errorMessage | $exception". This might give you an indication of issues in a cluster.

Errors Over Time

This is a timechart of the Errors Grouped over time.