The groupBy()
query function is used to
group together events by one or more specified fields. This is
similar to the GROUP BY
method
in SQL databases.
Parameter | Type | Required | Default Value | Description |
---|---|---|---|---|
field [a] | array of strings | required | This specifies which field or fields to group. | |
function | array of aggregate functions | optional[b] | count(as=_count) | This specifies which aggregate functions to use with each group. If several aggregators are listed for the function parameter, then their outputs are combined using the rules described for stats() . |
limit | string | optional[b] | 20,000 | This sets the limit for the number of group elements. Default value controlled by GroupDefaultLimit dynamic configuration, max value controlled by GroupMaxLimit . |
Maximum | 1,000,000 | |||
[b] Optional parameters use their default value unless explicitly set. |
The groupBy()
can be used to execute
aggregate functions on each group. The fields generated are
grouped by the field or fields in the
field
parameter for
each aggregate function. For example, by default the
count()
function is used and the grouped
count returned in the
_count field by default.
The default is to use count()
as an
aggregate function. However, if the goal is to only find the
unique values of a field or combinations of multiple fields,
then groupBy()
can be provided with an
empty list as its aggregate function.
When showing time series data, the
timeChart()
and
bucket()
functions are an extension of
groupBy()
that groups by time.
Warning
The groupBy()
may return incomplete
results limited to a restricted subset of groups and related
events, in cases where the
limit
parameter
is exceeded.
The groupBy()
function is limited in the
number of groups it will handle. Such limit is set in the
LogScale configuration using GraphQL mutations bound by
the GroupMaxLimit
dynamic configuration.
Starting from version 1.127, the maximum value of the
limit
parameter is
1,000,000 by default (adjustable by the corresponding
GroupMaxLimit
dynamic
configuration), which means it is possible to create up to 1
Million groups.
When the limit
parameter is not supplied explicitly, its default value is
determined by the GroupDefaultLimit
dynamic
configuration, which has a default value of 20,000.
GroupDefaultLimit
cannot be higher than
GroupMaxLimit
.
Attention
The limit
parameter is governed by the
dynamic configurations
GroupMaxLimit
and
GroupDefaultLimit
to the same value for a
seamless upgrade.
Additionally, like all LogScale functions,
groupBy()
has an internal memory limit
determined by the dynamic configuration
QueryCoordinatorMemoryLimit
. This means that it
is possible for the function to collect less than the specified
limit number of groups, if the total amount of data collected by
the function exceeds this limit. The internal memory limit is in
place to protect the cluster health, and ensure sufficient
memory for other queries to run. Should the memory limit be
reached, no more groups are added. This may lead to results that
are inconsistent with the results shown prior to the limit being
reached. The groups that were present in prior results may be
pushed out of the result set during merges between nodes, to
ensure that the memory usage is kept within bounds. If a group
is shown, the results for that group will be correct. In
situations where you reach the memory limit, you may want to use
the top()
function to narrow the list of
groups.
One of the reasons for this limitation is because
groupBy()
is implemented to work entirely
in memory, and cannot spill to disk. The limit helps to prevent
searches from consuming too much memory. This method and
limitation may be changed in future versions of
LogScale.
groupBy()
Examples
Since the most common use-case is to count the distinct values
of a field, the default behavior is to use
count()
to aggregate the events. The
simplest use-case of groupBy()
is,
therefore, to count the occurrence of the distinct values of a
field:
groupBy(status_code)
Given input:
status_code | ip |
---|---|
440 | 1.111.111.111 |
500 | 1.111.111.111 |
440 | 1.111.111.111 |
440 | 2.222.222.222 |
The output is:
status_code | _count |
---|---|
440 | 3 |
500 | 1 |
The task is to only find the unique values,
groupBy()
can be provided with an empty
list as the aggregate function, implying that nothing will be
aggregated:
groupBy(status_code, function=[])
The output is:
status_code |
---|
440 |
500 |
Similarly, groupBy()
can be used to find
the unique combinations of field values present in the data:
groupBy([status_code, ip], function=[])
The output is:
status_code | ip |
---|---|
440 | 1.111.111.111 |
500 | 1.111.111.111 |
440 | 2.222.222.222 |
A usage of the function such as the one above is also very useful for query-based parameters in dashboards, see Query Parameter for more explanations.
As an example of how to use this query function, suppose you
have a LogScale repository that is ingesting data from
one or more web servers. Now suppose you want to get a list of
HTTP status codes, the results of users accessing the server.
For instance, HTTP code 404 for Not
Found
— for web pages not found. You could get a
list of all of the HTTP codes returned and a count of each by
executing a query that counts the different http status codes,
like this:
groupBy(field=status_code, function=count())
| sort(statuscode, order=asc)
In addition to the groupBy()
, the results
here are piped to the sort()
function to
sort the results from the lowest HTTP code to the highest.
statuscode | _count |
---|---|
200 | 7485 |
301 | 2314 |
304 | 65 |
400 | 101 |
404 | 6425 |
405 | 2 |
408 | 9 |
With the groupBy()
function, you can have
LogScale group by more than one field. You would just
give the fields in an array, a comma-separated list within
square-brackets. It would look something like this query in
which events are grouped on HTTP method and status code:
groupBy(field=[method, statuscode], function=count())
| sort([method, statuscode], order=asc)
Will produce these results:
method | statuscode |
---|---|
CONNECT | 301 |
GET | 200 |
GET | 301 |
GET | 304 |
GET | 400 |
GET | 404 |
HEAD | 200 |
HEAD | 301 |
HEAD | 400 |
HEAD | 404 |
OPTIONS | 200 |
OPTIONS | 301 |
POST | 200 |
POST | 301 |
POST | 400 |
POST | 404 |
Although this query works, you may also want to show the total for each method, not just the total for each status code of each method. To do that, you'll have to adjust the query to look like this:
[groupBy(method, function=[count(as=method_total), groupBy(statuscode, function=count(as=method_status_count))])]
In this query, a groupBy()
is nested
within another groupBy()
. The results
will look like this:
method | method_total | statuscode | method_status_count |
---|---|---|---|
GET | 14078 | 200 | 7326 |
GET | 14078 | 301 | 1246 |
GET | 14078 | 304 | 65 |
GET | 14078 | 400 | 70 |
GET | 14078 | 404 | 5371 |
HEAD | 139 | 200 | 14 |
HEAD | 139 | 301 | 22 |
HEAD | 139 | 400 | 6 |
HEAD | 139 | 404 | 97 |
POST | 2005 | 200 | 14 |
POST | 2005 | 301 | 1002 |
POST | 2005 | 400 | 25 |
POST | 2055 | 404 | 964 |
These results still might not be as tidy as you like. You might do better with separate queries for each method or each status code — assuming you're interested in only a few specific ones. Another possibility might be to use an API to be able to assemble the results in a manner that you prefer.
Instead of a single function, additional query functions can
be also applied, allowing for calculations on the count that
is obtained from groupBy()
. The following
query contains an embedded expression within
groupBy()
, as a demonstration of
formatting/calculating the value within a
groupBy()
expression:
groupBy(host, function=[{count()
| esp:=_count/300}])
The query generates a count()
of the
number of hosts in the aggregation, which creates a field
_count which is then
divided by 300
and placed into
the field esp.