Insights Search Dashboard

This dashboard will provide some more insights into queries and searches made in a LogScale cluster.

Search Queue

This is the number of segments to be queued for search by a query per vHost (i.e. LogScale Node ID). When a query is run by a user or a dashboard or an alert, LogScale needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren't available, the queries are put into a queue.

Ideally, this value per LogScale node is kept at 0. This means that no LogScale nodes have to wait to scan segments as soon as LogScale gets the query. Spikes can be expected, especially during times when more queries are received than usual. A constant queue, however, could indicate built up load on the nodes. This will mean slow queries.

CPU Usage in Percent

This shows the amount of CPU usage of each LogScale node. Within a cluster, if each LogScale node has the same specifications, and digest partitions are evenly distributed, you would expect each LogScale node to have about the same CPU usage.

If some nodes are experiencing particularly high usage, this could indicate that something is wrong with LogScale or the cluster setup.

Query Restarts By Reason

There are occasions in LogScale when a query will need to be restarted. It could be an alert or a dashboard query. This timechart shows the reasons why a query might have been restarted over time.

There are a few common reasons:

  • Ingest Partition changes in the LogScale cluster;

  • Lookup File changes that are used in the query;

  • Permission changes on the query;

  • View Connection changes.

A view connection change will update the repository connections in a view where the query might be running.

The list of common reasons above are to be expected and could be ignored. The reasons listed below, though, are worth investigating:

  • Poll Error because of a dead host. This means a LogScale node is down and should be investigated as to why.

  • Statuscode=404. It could be worth checking the query, if it is an alert, and why it's causing a 404 error.

Starved Searches

A starved search in LogScale is when a query cannot proceed to finish its query because it's restricted by resources in scanning the segment files, or segment files are pending to be fetched. This timechart shows the number of times it has experienced the starved searches log per LogScale node.

Each log with the starved searches text will come with a queryID. You can then search for it in the LogScale repository to find out which queries are having this issue.

Query Total Cost

This utilises a LogScale Metrics called query-delta-total-cost. There is a log of this metric per LogScale host every 30 seconds of the delta of the total cost on queries for the entire cluster.

The Cost Points on the query is the unit LogScale uses to schedule, limit, and monitor queries. A cost point is a combination of both the memory and CPU consumption that a query has, and can be used as a measurement of how expensive a query is overall.

Query Memory Allocation Cost

This utilises a LogScale Metrics called query-delta-total-memory-allocation. There is a log of this metric per LogScale host every 30 seconds of the delta of the total cost on queries for the entire cluster.

This is important since the way LogScale works when you run a query is that all of the segment files within the timeframe of that query will be pulled across into the memory of a LogScale node and are decompressed and scanned there. This means that if your cluster is maxed on memory for queries, it could be a reason to slow performance on the cluster or queries not finishing.

Top Cost Queries

This shows the heaviest queries run on a LogScale cluster within the last hour along with its cost.

The Cost Points of the query is the unit LogScale uses to schedule, limit, and monitor queries. A cost point is a combination of both the memory and CPU consumption that a query has, and can be used as measurement of how expensive a query is overall.

This can help you gauge which queries are heavy and require plenty work for the cluster. You may have a query that is causing too much work. If so, you need to kill it to release resources to your LogScale cluster. This is where you can Blocking Queries.

Top Cost Queries by User

This shows the heaviest query users in a LogScale cluster within the last hour, along with their cost.

This is where you may want to implement Query Quotas, if some users are using too many resources in the cluster with inefficient queries.

Query Historical Cost

Historical queries are essentially any static or non-live query. This utilises a LogScale Metrics called, query-static-delta-cpu-usage. There is a log of this metric per LogScale host every 30 seconds. It logs the delta of the total cost on these historic or static queries for the entire cluster.

Query Threaddumps with Query ID's

Within a LogScale cluster, LogScale logs constantly what each thread is doing at a particular time into the humio-threaddumps.log. Each threaddump contains the name of the group of which it belongs. This should logically indicate the type of work being done by LogScale.

In this case, we're looking at the query-mapper thread group which also logs the queryID. This indicates which queries are taking up the most threads over the last 24 hours. That can tell if particular queries are using too many resources on the LogScale cluster.

To investigate any given queryID to find out more information, you can search for the queryID in the humio repository. Try a search like this:

logscale
queryID="$QUERY_ID" "createQuery"
| groupBy([queryID,dataspace,live,query])

Top Queries In Mapper Threads

This widget is very similar to the Query Threaddumps with Query ID's, except that it presents the information in table, along with the query being run.

The queries with the most threaddumps logs indicate it's using more resources on the cluster than other queries.

Query Live Cost

This timechart looks specifically at the cost of live queries across the cluster. This utilises a LogScale Metrics called query-static-delta-cpu-usage. There is a log of this metric per LogScale host every 30 seconds. It logs the delta of the total cost on these historic/static queries for the entire cluster.

Time Spend Reading Segments

This timechart shows the average time spent per LogScale host reading (i.e. waiting for) blocks from segment files in milliseconds. This is indicative of the performance of queries in LogScale since reading the blocks from the segment files is part of executing the query and producing the queries results.

Keeping this value below 5 milliseconds is a sign of a healthy cluster and performance speed.

Read Segment Files Performance

This timechart shows the average number of bytes per second read from compressed blocks in segment files per LogScale host. Queries that look over large timeframes will need to scan more compressed blocks. Heavier queries like that can cause spikes in this graph.

CPU Usage: Thread Group Ticks

This widget shows the number of CPU ticks used by each LogScale thread group. Within a LogScale cluster, the number of threads is usually dictated by the number of cores running on each node. The number of threads is then assigned to each particular thread group. The names of the groups should logically indicate the type of work being done by LogScale. This widget can indicate the amount of time being spent by the CPU for each of these thread groups.

A common thread group which will consume resources is the humio-pekko thread group (formerly humio-akka). This is the group responsible for handling network requests. As a result, it may take more time since it spends a lot of time waiting for responses.

Another common thread group is the digester. This is the thread that handles digesting all new data coming into LogScale.

The runningqueries in particular for this dashboard will be interesting to compare to the other thread groups.

Slow Warnings to Users

When running a query in LogScale, if one or more of the LogScale nodes is slow or not responding, you'll receive a warning in the User Interface letting you know that the query is slow. This widget lists how many times that warning was shown to users for each LogScale host.

To fix this, you will need to investigate which nodes are receiving this warning. You do this by running this query:

logscale
#type=humio #kind=logs loglevel=WARN class="c.h.q.QuerySessions$" "user got a queryresult containing a warning" (warning="*slow*" or warning="*respon*") /server node \'(?<node>\S+)\'/
| groupBy([node])

This will return the nodes currently not responding or that are slow. It may be that there is Ingest Latency or a heavy query has consumed LogScale's resources.

A healthy system should show no slow warnings to users.

Live Queries per Host

The way live queries work in LogScale is that they are analysed at ingest as they are coming in, before being processed and stored as segment files. This widget shows how many queries are running on each LogScale node which is responsible for running the query as ingest comes in. Although live queries aren't very heavy work for LogScale nodes, this can be useful to see if a LogScale node is doing more for live queries than others.

HTTP Internal Query Requests

Internal HTTP requests are initiated by LogScale nodes. This widget shows internal HTTP requests directly hitting the query endpoint. An example of an internal query to the query jobs endpoint would be proxying a query to a LogScale host supposed to be in charge of a particular query.

This widget shows the number of these internal requests per second for each LogScale node.

HTTP External Query Requests

External HTTP requests are initiated typically by users of LogScale. This widget shows the external HTTP requests to the query endpoint which is usually either a dashboard's widget, and alert query or a user running an ad-hoc query.

This widget then shows the number of these external HTTP requests per second for each LogScale node.

HTTP Query Submits Per Repository

The widget shows the number of queries submitted per minute per repository on a LogScale cluster.