FAQ: Understanding the Query State Size

The query state size, also known as state size, or query state, quantifies the amount of memory used by a query during execution.

Queries contains mainly three types of operations; filters, mutators, and aggregators. Where filters and mutators mainly work on a single event at a time, aggregators collect the result of several events and contribute the most to the overall memory consumption. For an aggregation, the state size contains a list of the events in the overall query in that part of the query chain.

The size of the query state depends on the number of events and the type of operation:

With groupBy() LogScale uses more memory because the function collects all possible values of a field or set of fields. The overall query state size is dependent on the function, the algorithm used and the number of events in each group within with the query.
With top(), to find the most accessed URLs in webserver logs. Performing this calculation would require keeping all the different URLs in the search state and count the number of occurrences of each. The more unique URLs, the larger the state.

LogScale uses compression and other techniques to keep this value to a minimum, but it is difficult to predict the state size in advance. Approximation algorithms are used to provide numbers and counts when computing the exact value would be too computationally expensive.

The effect of the query state size is that for some queries and event collections, the amount of memory required can be considerable. This is one of the reasons why the default query size is limited to 200 events; the limit helps to reduce the overall state size.

When creating searches that hits the limits on state sizes, LogScale will warn the user. For, example groupBy() on a high cardinality field resulting in millions of groups.

The state size is also related to the query cost, which is calculated by combining the memory used by the query and the CPU time required to generate it.

Dealing With High Cardinality Data Sets

The query can be cancelled for the benefit of the cluster as a whole, because some aggregators in the query create too many rows. This can happen for instance, when groupBy() or stats() are given multiple sub-aggregators, each with a high cardinality output. High cardinality fields are fields with many unique values.

To avoid cardinality issues with query aggregators, try reducing the number of sub-aggregators, or reduce their cardinality (limit=${limit}), otherwise this aggregation can lead to performance issues and inaccurate results.

Consider the following:

Filter data before aggregating to reduce the data set size.
Reduce the number of rows returned by aggregators. This can be done by reducing the number of sub-aggregators or reducing their cardinality.
Use time-based bucketing for time series data.
Limit the time range of the queries.

Knowledge Base

FAQ: Understanding the Query State Size

Dealing With High Cardinality Data Sets

Other articles on this topic

Enter search term