The groupBy() query function is used to group together events by one or more specified fields. This is similar to the GROUP BY method in SQL databases.

ParameterTypeRequiredDefault ValueDescription
field[a]Array of stringsrequired  This specifies which field or fields to group.
functionArray of Aggregate Functionsoptional[b]count(as=_count) This specifies which aggregate functions to use with each group.
limitstringoptional[b]20,000 This sets the limit for the number of group elements. Default value controlled by GroupDefaultLimit dynamic configuration, max value controlled by GroupMaxLimit.
  Maximum1,000,000 

[a] The argument name field can be omitted.

[b] Optional parameters use their default value unless explicitly set.

Hide omitted argument names for this function

Show omitted argument names for this function

The groupBy() can be used to execute aggregate functions on each group. The fields generated are grouped by the field or fields in the field parameter for each aggregate function. For example, by default the count() function is used and the grouped count returned in the _count field by default.

The default is to use count() as an aggregate function. However, if the goal is to only find the unique values of a field or combinations of multiple fields, then groupBy() can be provided with an empty list as its aggregate function.

When showing time series data, the timeChart() and bucket() functions are an extension of groupBy() that groups by time.

Warning

groupBy may return incomplete results limited to a restricted subset of groups and related events, in cases where the limit parameter is exceeded.

The groupBy() function is limited in the number of groups it will handle. Such limit is set in the LogScale configuration using GraphQL mutations bound by the GroupMaxLimit dynamic configuration.

Starting from version 1.127, the maximum value of the limit parameter is 1,000,000 by default (adjustable by the corresponding GroupMaxLimit dynamic configuration), which means it is possible to create up to 1 Million groups.

When the limit parameter is not supplied explicitly, its default value is determined by the GroupDefaultLimit dynamic configuration, which has a default value of 20,000.

GroupDefaultLimit cannot be higher than GroupMaxLimit.

Attention

The limit parameter is governed by the dynamic configurations GroupMaxLimit and GroupDefaultLimit to the same value for a seamless upgrade.

Additionally, like all LogScale functions, groupBy() has an internal memory limit determined by the dynamic configuration QueryCoordinatorMemoryLimit. This means that it is possible for the function to collect less than the specified limit number of groups, if the total amount of data collected by the function exceeds this limit. The internal memory limit is in place to protect the cluster health, and ensure sufficient memory for other queries to run. Should the memory limit be reached, no more groups are added. This may lead to results that are inconsistent with the results shown prior to the limit being reached. The groups that were present in prior results may be pushed out of the result set during merges between nodes, to ensure that the memory usage is kept within bounds. If a group is shown, the results for that group will be correct. In situations where you reach the memory limit, you may want to use the top() function to narrow the list of groups.

One of the reasons for this limitation is because groupBy() is implemented to work entirely in memory, and cannot spill to disk. The limit helps to prevent searches from consuming too much memory. This method and limitation may be changed in future versions of LogScale.

groupBy() Examples

Since the most common use-case is to count the distinct values of a field, the default behavior is to use count() to aggregate the events. The simplest use-case of groupBy() is, therefore, to count the occurence of the distinct values of a field:

logscale
groupBy(status_code)

Given input:

status_code ip
440 1.111.111.111
500 1.111.111.111
440 1.111.111.111
440 2.222.222.222

The output is:

status_code _count
440 3
500 1

The task is to only find the unique values, groupBy() can be provided with an empty list as the aggregate function, implying that nothing will be aggregated:

logscale
groupBy(status_code, function=[])

The output is:

status_code
440
500

Similarly, groupBy() can be used to find the unique combinations of field values present in the data:

logscale
groupBy([status_code, ip], function=[])

The output is:

status_code ip
440 1.111.111.111
500 1.111.111.111
440 2.222.222.222

A usage of the function such as the one above is also very useful for query-based parameters in dashboards, see Query Parameter for more explanations.

As an example of how to use this query function, suppose you have a LogScale repository that is ingesting data from one or more web servers. Now suppose you want to get a list of HTTP status codes, the results of users accessing the server. For instance, HTTP code 404 for Not Found — for web pages not found. You could get a list of all of the HTTP codes returned and a count of each by executing a query that counts the different http status codes, like this:

logscale
groupBy(field=status_code, function=count())
| sort(statuscode, order=asc)

In addition to the groupBy(), the results here are piped to the sort() function to sort the results from the lowest HTTP code to the highest.

statuscode_count
2007485
3012314
30465
400101
4046425
4052
4089

With the groupBy() function, you can have LogScale group by more than one field. You would just give the fields in an array, a comma-separated list within square-brackets. It would look something like this query in which events are grouped on HTTP method and status code:

logscale
groupBy(field=[method, statuscode], function=count())
| sort([method, statuscode], order=asc)

Will produce these results:

methodstatuscode
CONNECT301
GET200
GET301
GET304
GET400
GET404
HEAD200
HEAD301
HEAD400
HEAD404
OPTIONS200
OPTIONS301
POST200
POST301
POST400
POST404

Although this query works, you may also want to show the total for each method, not just the total for each status code of each method. To do that, you'll have to adjust the query to look like this:

logscale
[groupby(method, function=[count(as=method_total), groupby(statuscode, function=count(as=method_status_count))])]

In this query, a groupBy() is nested within another groupBy(). The results will look like this:

methodmethod_totalstatuscodemethod_status_count
GET140782007326
GET140783011246
GET1407830465
GET1407840070
GET140784045371
HEAD13920014
HEAD13930122
HEAD1394006
HEAD13940497
POST200520014
POST20053011002
POST200540025
POST2055404964

These results still might not be as tidy as you like. You might do better with separate queries for each method or each status code — assuming you're interested in only a few specific ones. Another possibility might be to use an API to be able to assemble the results in a manner that you prefer.

Instead of a single function, additional query functions can be also applied, allowing for calculations on the count that is obtained from groupBy(). The following query contains an embedded expression within groupBy(), as a demonstration of formatting/calculating the value within a groupBy() expression:

logscale
groupBy(host, function=[{count()
| esp:=_count/300}])

The query generates a count() of the number of hosts in the aggregation, which creates a field _count which is then divided by 300 and placed into the field esp.