FAQ: Understanding the Query State Size
The query state size, also known as state size, or query state, quantifies the amount of memory used by a query during execution.
Queries contains mainly three types of operations; filters, mutators, and aggregators. Where filters and mutators mainly work on a single event at a time, aggregators collect the result of several events and contribute the most to the overall memory consumption. For an aggregation, the state size contains a list of the events in the overall query in that part of the query chain.
The size of the query state depends on the number of events and the type of operation:
With
groupBy()
LogScale uses more memory because the function collects all possible values of a field or set of fields. The overall query state size is dependent on the function, the algorithm used and the number of events in each group within with the query.With
top()
, to find the most accessed URLs in webserver logs. Performing this calculation would require keeping all the different URLs in the search state and count the number of occurrences of each. The more unique URLs, the larger the state.
LogScale uses compression and other techniques to keep this value to a minimum, but it is difficult to predict the state size in advance. Approximation algorithms are used to provide numbers and counts when computing the exact value would be too computationally expensive.
The effect of the query state size is that for some queries and event collections, the amount of memory required can be considerable. This is one of the reasons why the default query size is limited to 200 events; the limit helps to reduce the overall state size.
When creating searches that hits the limits on state sizes,
LogScale will warn the user. For, example
groupBy()
on a high cardinality field resulting in
millions of groups.
The state size is also related to the query cost, which is calculated by combining the memory used by the query and the CPU time required to generate it.
Dealing With High Cardinality Data Sets
The query can be cancelled for the benefit of the cluster as a whole,
because some aggregators in the query create too many rows. This can
happen for instance, when groupBy()
or
stats()
are given multiple sub-aggregators, each
with a high cardinality output. High cardinality fields are fields
with many unique values.
To avoid cardinality issues with query aggregators, try reducing the
number of sub-aggregators, or reduce their cardinality
(limit=${limit}
), otherwise this
aggregation can lead to performance issues and inaccurate results.
Consider the following:
Filter data before aggregating to reduce the data set size.
Reduce the number of rows returned by aggregators. This can be done by reducing the number of sub-aggregators or reducing their cardinality.
Use time-based bucketing for time series data.
Limit the time range of the queries.