Insights Ingest Dashboard

In the Ingest Dashboard , from the Insights Package, you can see plenty of information regarding the data you're sending to your LogScale cluster. This documentation page explains on how each of the widgets can be useful.

Ingest per Host

This widget shows how much Ingest each LogScale node is receiving in bytes per day.

The distribution of ingest a node receives is usually dictated by the number of Digest Rules configured in the Cluster Administration page per host. If a node is receiving too little or too much ingest compared to other nodes, you may want to Digest Partitions so they are distributed evenly.

Ingest per Repository

This widget shows how much Ingest each LogScale repository is receiving in bytes per day.

This can be useful for monitoring from where the source of your ingest is coming. You may need this when you need to reduce ingest per day for license reasons. To reduce the load on the cluster, you can stop shipping data to particular repositories that have a high ingest or Block Ingest via the LogScale UI in the repository settings.

CPU Usage in Percent

This shows you how much CPU each LogScale node is utilising. Within a cluster, if each LogScale node has the same specifications, and digest partitions are evenly distributed, you would expect each LogScale node to have around the same CPU usage.

If some nodes are experiencing particularly high usage, this could indicate that something is wrong with LogScale or your cluster setup.

Ingest Latency per Host (Digest)

This a very important metric in LogScale and can indicate slowness in the cluster. This timechart shows the average and median of the ingest latency metric. Ingest latency is defined as the time taken from an event being inserted into the ingest queue to then being digested — before being parsed — updating live queries and adding the event to blocks ready for segment files

Keeping this value ideally less than 10 seconds per node is a sign that the cluster is healthy.

Continuous increases in latency on one or more nodes can suggest problems. This is usually because LogScale's not digesting as fast as it's ingesting. It could mean we are sending too much data than the capabilities of the resources or the resources are being used else where.

LogScale has a threshold built in that will start rejecting events from log shippers if Ingest Latency reaches a certain limit. See reference page on the MAX_INGEST_DELAY_SECONDS configuration variable.

Ingest Latency per Host (Partition)

This is similar to the above, except that this measures latency per Digest Rules.

Ingest Partition Changes

This timechart will show the number of changes to the set of active digest nodes triggered by digest coordination.

For a healthy system, this is close to zero, except when an administrator alters the desired digest partition scheme via the Digest Partitions.

Node Shutdowns

This is a timechart of which LogScale nodes have been shutdown in the last 24 hours.

If a node shutdown is unexpected, there may be ERROR logs in the humio repository explaining them. Running a query in the humio repository for logs around that particular host before it shutdown may also explain the problem. Other widgets in the LogScale Insights App also might show something wrong with that particular node. Be sure to check the Ingest Errors.

Killed Live Queries because of Ingest Delay

This timechart shows which LogScale nodes have cancelled their live queries due to Ingest Latency on that node. Due to the way live queries work, they are analysed on ingest before they get stored as a segment files. This means that if there is significant ingest delay, these queries won't be accurate or up-to-date. This is especially important for alerts.

There is an environment variable that triggers this behaviour called, LIVEQUERY_CANCEL_TRIGGER_DELAY_MS.

Ticks Spent in Digest and Live Searches (Millis Per Second)

This timechart shows the number of CPU ticks used on specifically digest work and live query work.

Time Spent in Parsers

This timechart shows the total time spent for events in millis per second within a parser across all LogScale hosts and parsers.

Parsers Using The Most Time (Millis)

This table shows you the most used parsers in your LogScale cluster. For each parser, it will tell the timeInParser, eventsPerSecond and timePerEvent.

The timeInParser is the total time spent parsing events in ms for that parser in the last hour.

The timePerEvent is the average time to each event in the last hour. This is very important as this may tell you if you have an inefficient parser. Anything that is averaging greater than 1 millisecond per event should be investigated. This can usually be the cause of Ingest Latency per Host (Partition) since digesting new data can be slowed down entirely if a lot of work from the LogScale nodes is spent parsing data.

Ingest Errors

A timechart of the Node-Level Metrics named data-ingester-errors. It shows the errors per second for each repository where there was an error parsing the event. To investigate this, you can run a query in the repository affected by the errors that looks like this:

logscale
@error=true 
| groupby(@error_msg)

This will show you all of the ingest error messages. It should give you an indication as to what went wrong.

Datasources Increasing or Decreasing Auto-Sharding

This timechart shows the datasources that have increased or decreased its number of Configure Auto-Sharding for High-Volume Data Sources. What this means is that if there is a LogScale datasource that's receiving a high volume of ingest, LogScale needs to allocate more resources to digest that volume. It does this by allocating autoshards. This is essentially more CPU cores to help digest that data.

This widget can be useful to identify the datasources that are high volume and may potentially begin to struggle with it's ingest load.

Datasources Hitting Max Auto-Shards

Following from the above, this timechart shows which of those particular high-volume datasources that are now hitting the LogScale max limit of auto-shards.

LogScale has a configurable value for the maximum amount of auto-shards that can be applied to a datasource with the default being 16. This is configurable via the Configure Auto-Sharding for High-Volume Data Sources environment variable. If you have a LogScale cluster that's using multiple terabytes day and you have datasources hitting this limit, you might increase the Configure Auto-Sharding for High-Volume Data Sources variable — but not over 128. Auto-sharding is only a temporary solution, though. For a more permanent solution, you should attempt to configure proper Event Tags on the datasource experiencing issues.

Typically, one core on a LogScale node should be enough to handle 256GB per day of Ingest since each auto-shard is the equivalent of allocating another core to digesting the data — the other maximum value of auto-shards is the number of cores you have available in your LogScale cluster. If you're hitting the maximum number of auto-shards and that number is equivalent to the total number of cores of the LogScale cluster, you may need to allocate more resources to LogScale to handle the amount of ingest you are sending.

Datasources Auto-Sharding

This table shows the current datasources in the last 24 hours that have required auto-sharding due to high-volume ingest and the maximum amount of auto-shards it reached.

Number of Datasource per Repository

This timechart shows you the top repositories with the most LogScale Internal Architecture and how many each repository has.

A Datasource is a set of Events that have the same Event Tags. LogScale divides each repository into more than one Data Source based on the tags.

The maximum amount of datasources that can be applied to a repository is 10000. This is defined by the MAX_DATASOURCES variable. LogScale can run into issues if you exceed this limit. It's very important to implement Event Tags on low cardinality fields to avoid getting into this situation. If you exceed this limit, you will not be able to ingest data into the repository affected.