Insights Search Dashboard
This dashboard will provide some more insights into queries and searches made in a LogScale cluster.
Search Queue
This is the number of segments to be queued for search by a query per vHost (i.e. LogScale Node ID). When a query is run by a user or a dashboard or an alert, LogScale needs resources to pull the segment files in question, have them scanned and then return the results to the query. If those resources aren't available, the queries are put into a queue.
Ideally, this value per LogScale node is kept at 0. This means that no LogScale nodes have to wait to scan segments as soon as LogScale gets the query. Spikes can be expected, especially during times when more queries are received than usual. A constant queue, however, could indicate built up load on the nodes. This will mean slow queries.
CPU Usage in Percent
This shows the amount of CPU usage of each LogScale node. Within a cluster, if each LogScale node has the same specifications, and digest partitions are evenly distributed, you would expect each LogScale node to have about the same CPU usage.
If some nodes are experiencing particularly high usage, this could indicate that something is wrong with LogScale or the cluster setup.
Query Restarts By Reason
There are occasions in LogScale when a query will need to be restarted. It could be an alert or a dashboard query. This timechart shows the reasons why a query might have been restarted over time.
There are a few common reasons:
Ingest Partition changes in the LogScale cluster;
Lookup File changes that are used in the query;
Permission changes on the query;
View Connection changes.
A view connection change will update the repository connections in a view where the query might be running.
The list of common reasons above are to be expected and could be ignored. The reasons listed below, though, are worth investigating:
Poll Error because of a dead host. This means a LogScale node is down and should be investigated as to why.
Statuscode=404. It could be worth checking the query, if it is an alert, and why it's causing a 404 error.
Starved Searches
A starved search in LogScale is when a query cannot proceed to finish its query because it's restricted by resources in scanning the segment files, or segment files are pending to be fetched. This timechart shows the number of times it has experienced the starved searches log per LogScale node.
Each log with the starved searches text will come with a queryID. You can then search for it in the LogScale repository to find out which queries are having this issue.
Query Total Cost
This utilises a LogScale Metrics called
query-delta-total-cost
. There is
a log of this metric per LogScale host every 30 seconds of the delta
of the total cost on queries for the entire cluster.
The Cost Points on the query is the unit LogScale uses to schedule, limit, and monitor queries. A cost point is a combination of both the memory and CPU consumption that a query has, and can be used as a measurement of how expensive a query is overall.
Query Memory Allocation Cost
This utilises a LogScale Metrics called
query-delta-total-memory-allocation
.
There is a log of this metric per LogScale host every 30 seconds of
the delta of the total cost on queries for the entire cluster.
This is important since the way LogScale works when you run a query is that all of the segment files within the timeframe of that query will be pulled across into the memory of a LogScale node and are decompressed and scanned there. This means that if your cluster is maxed on memory for queries, it could be a reason to slow performance on the cluster or queries not finishing.
Top Cost Queries
This shows the heaviest queries run on a LogScale cluster within the last hour along with its cost.
The Cost Points of the query is the unit LogScale uses to schedule, limit, and monitor queries. A cost point is a combination of both the memory and CPU consumption that a query has, and can be used as measurement of how expensive a query is overall.
This can help you gauge which queries are heavy and require plenty work for the cluster. You may have a query that is causing too much work. If so, you need to kill it to release resources to your LogScale cluster. This is where you can Blocking Queries.
Top Cost Queries by User
This shows the heaviest query users in a LogScale cluster within the last hour, along with their cost.
This is where you may want to implement Query Quotas, if some users are using too many resources in the cluster with inefficient queries.
Query Historical Cost
Historical queries are essentially any static or non-live query. This
utilises a LogScale Metrics called,
query-static-delta-cpu-usage
.
There is a log of this metric per LogScale host every 30 seconds. It
logs the delta of the total cost on these historic or static queries
for the entire cluster.
Query Threaddumps with Query ID's
Within a LogScale cluster, LogScale logs constantly what each thread
is doing at a particular time into the
humio-threaddumps.log
. Each threaddump contains
the name of the group of which it belongs. This should logically
indicate the type of work being done by LogScale.
In this case, we're looking at the
query-mapper
thread group which
also logs the queryID
. This
indicates which queries are taking up the most threads over the last
24 hours. That can tell if particular queries are using too many
resources on the LogScale cluster.
To investigate any given queryID to find out more information, you can
search for the queryID
in the
humio
repository. Try a search
like this:
queryID="$QUERY_ID" "createQuery"
| groupBy([queryID,dataspace,live,query])
Top Queries In Mapper Threads
This widget is very similar to the Query Threaddumps with Query ID's, except that it presents the information in table, along with the query being run.
The queries with the most threaddumps logs indicate it's using more resources on the cluster than other queries.
Query Live Cost
This timechart looks specifically at the cost of live queries across
the cluster. This utilises a LogScale Metrics called
query-static-delta-cpu-usage
.
There is a log of this metric per LogScale host every 30 seconds. It
logs the delta of the total cost on these historic/static queries for
the entire cluster.
Time Spend Reading Segments
This timechart shows the average time spent per LogScale host reading (i.e. waiting for) blocks from segment files in milliseconds. This is indicative of the performance of queries in LogScale since reading the blocks from the segment files is part of executing the query and producing the queries results.
Keeping this value below 5 milliseconds is a sign of a healthy cluster and performance speed.
Read Segment Files Performance
This timechart shows the average number of bytes per second read from compressed blocks in segment files per LogScale host. Queries that look over large timeframes will need to scan more compressed blocks. Heavier queries like that can cause spikes in this graph.
CPU Usage: Thread Group Ticks
This widget shows the number of CPU ticks used by each LogScale thread group. Within a LogScale cluster, the number of threads is usually dictated by the number of cores running on each node. The number of threads is then assigned to each particular thread group. The names of the groups should logically indicate the type of work being done by LogScale. This widget can indicate the amount of time being spent by the CPU for each of these thread groups.
A common thread group which will consume resources is the
humio-pekko
thread group
(formerly humio-akka
). This is
the group responsible for handling network requests. As a result, it
may take more time since it spends a lot of time waiting for
responses.
Another common thread group is the
digester
. This is the thread
that handles digesting all new data coming into LogScale.
The runningqueries
in particular
for this dashboard will be interesting to compare to the other thread
groups.
Slow Warnings to Users
When running a query in LogScale, if one or more of the LogScale nodes is slow or not responding, you'll receive a warning in the User Interface letting you know that the query is slow. This widget lists how many times that warning was shown to users for each LogScale host.
To fix this, you will need to investigate which nodes are receiving this warning. You do this by running this query:
#type=humio #kind=logs loglevel=WARN class="c.h.q.QuerySessions$" "user got a queryresult containing a warning" (warning="*slow*" or warning="*respon*") /server node \'(?<node>\S+)\'/
| groupBy([node])
This will return the nodes currently not responding or that are slow. It may be that there is Ingest Latency or a heavy query has consumed LogScale's resources.
A healthy system should show no slow warnings to users.
Live Queries per Host
The way live queries work in LogScale is that they are analysed at ingest as they are coming in, before being processed and stored as segment files. This widget shows how many queries are running on each LogScale node which is responsible for running the query as ingest comes in. Although live queries aren't very heavy work for LogScale nodes, this can be useful to see if a LogScale node is doing more for live queries than others.
HTTP Internal Query Requests
Internal HTTP requests are initiated by LogScale nodes. This widget shows internal HTTP requests directly hitting the query endpoint. An example of an internal query to the query jobs endpoint would be proxying a query to a LogScale host supposed to be in charge of a particular query.
This widget shows the number of these internal requests per second for each LogScale node.
HTTP External Query Requests
External HTTP requests are initiated typically by users of LogScale. This widget shows the external HTTP requests to the query endpoint which is usually either a dashboard's widget, and alert query or a user running an ad-hoc query.
This widget then shows the number of these external HTTP requests per second for each LogScale node.
HTTP Query Submits Per Repository
The widget shows the number of queries submitted per minute per repository on a LogScale cluster.