The groupBy()
query function is used to group
together events by one or more specified fields. This is similar to the
GROUP BY
method in SQL databases.
Parameter | Type | Required | Default Value | Description |
---|---|---|---|---|
field [a] | Array of strings | required | This specifies which field or fields to group. | |
function | Array of Aggregate Functions | optional[b] | count(as=_count) | This specifies which aggregate functions to use with each group. |
limit | string | optional[b] | 20,000 | This sets the limit for the number of group elements. Default value controlled by GroupDefaultLimit dynamic configuration, max value controlled by GroupMaxLimit . |
Maximum | 1,000,000 | |||
[b] Optional parameters use their default value unless explicitly set. |
Hide omitted argument names for this function
Omitted Argument NamesThe argument name for
field
can be omitted; the following forms of this function are equivalent:logscalegroupBy("field")
and:
logscalegroupBy(field="field")
These examples show basic structure only.
The groupBy()
can be used to execute aggregate
functions on each group. The fields generated are grouped by the field or
fields in the field
parameter
for each aggregate function. For example, by default the
count()
function is used and the grouped count
returned in the _count field by default.
The default is to use count()
as an aggregate
function. However, if the goal is to only find the unique values of a
field or combinations of multiple fields, then
groupBy()
can be provided with an empty list as its
aggregate function.
When showing time series data, the timeChart()
and
bucket()
functions are an extension of
groupBy()
that groups by time.
Warning
groupBy
may return incomplete results limited to a
restricted subset of groups and related events, in cases where the
limit
parameter is
exceeded.
The groupBy()
function is limited in the number of
groups it will handle. Such limit is set in the LogScale
configuration using GraphQL mutations bound by the
GroupMaxLimit
dynamic
configuration.
Starting from version 1.127, the maximum value of the
limit
parameter is 1,000,000
by default (adjustable by the corresponding GroupMaxLimit
dynamic configuration), which means it is possible to create up to 1
Million groups.
When the limit
parameter is
not supplied explicitly, its default value is determined by the
GroupDefaultLimit
dynamic configuration, which has a
default value of 20,000.
GroupDefaultLimit
cannot be higher than
GroupMaxLimit
.
Attention
The limit
parameter is governed by the dynamic
configurations GroupMaxLimit
and
GroupDefaultLimit
to the same value for a seamless
upgrade.
Additionally, like all LogScale functions,
groupBy()
has an internal memory limit determined by
the dynamic configuration QueryCoordinatorMemoryLimit
. This
means that it is possible for the function to collect less than the
specified limit number of groups, if the total amount of data collected by
the function exceeds this limit. The internal memory limit is in place to
protect the cluster health, and ensure sufficient memory for other queries
to run. Should the memory limit be reached, no more groups are added. This
may lead to results that are inconsistent with the results shown prior to
the limit being reached. The groups that were present in prior results may
be pushed out of the result set during merges between nodes, to ensure
that the memory usage is kept within bounds. If a group is shown, the
results for that group will be correct. In situations where you reach the
memory limit, you may want to use the top()
function
to narrow the list of groups.
One of the reasons for this limitation is because
groupBy()
is implemented to work entirely in memory,
and cannot spill to disk. The limit helps to prevent searches from
consuming too much memory. This method and limitation may be changed in
future versions of LogScale.
groupBy()
Examples
Since the most common use-case is to count the distinct values of a
field, the default behavior is to use count()
to
aggregate the events. The simplest use-case of
groupBy()
is, therefore, to count the occurence of
the distinct values of a field:
groupBy(status_code)
Given input:
status_code | ip |
---|---|
440 | 1.111.111.111 |
500 | 1.111.111.111 |
440 | 1.111.111.111 |
440 | 2.222.222.222 |
The output is:
status_code | _count |
---|---|
440 | 3 |
500 | 1 |
The task is to only find the unique values,
groupBy()
can be provided with an empty list as the
aggregate function, implying that nothing will be aggregated:
groupBy(status_code, function=[])
The output is:
status_code |
---|
440 |
500 |
Similarly, groupBy()
can be used to find the unique
combinations of field values present in the data:
groupBy([status_code, ip], function=[])
The output is:
status_code | ip |
---|---|
440 | 1.111.111.111 |
500 | 1.111.111.111 |
440 | 2.222.222.222 |
A usage of the function such as the one above is also very useful for query-based parameters in dashboards, see Query Parameter for more explanations.
As an example of how to use this query function, suppose you have a
LogScale repository that is ingesting data from one or more web
servers. Now suppose you want to get a list of HTTP status codes, the
results of users accessing the server. For instance, HTTP code 404 for
Not Found
— for web pages not
found. You could get a list of all of the HTTP codes returned and a
count of each by executing a query that counts the different http status
codes, like this:
groupBy(field=status_code, function=count())
| sort(statuscode, order=asc)
In addition to the groupBy()
, the results here are
piped to the sort()
function to sort the results
from the lowest HTTP code to the highest.
statuscode | _count |
---|---|
200 | 7485 |
301 | 2314 |
304 | 65 |
400 | 101 |
404 | 6425 |
405 | 2 |
408 | 9 |
With the groupBy()
function, you can have
LogScale group by more than one field. You would just give the
fields in an array, a comma-separated list within square-brackets. It
would look something like this query in which events are grouped on HTTP
method and status code:
groupBy(field=[method, statuscode], function=count())
| sort([method, statuscode], order=asc)
Will produce these results:
method | statuscode |
---|---|
CONNECT | 301 |
GET | 200 |
GET | 301 |
GET | 304 |
GET | 400 |
GET | 404 |
HEAD | 200 |
HEAD | 301 |
HEAD | 400 |
HEAD | 404 |
OPTIONS | 200 |
OPTIONS | 301 |
POST | 200 |
POST | 301 |
POST | 400 |
POST | 404 |
Although this query works, you may also want to show the total for each method, not just the total for each status code of each method. To do that, you'll have to adjust the query to look like this:
[groupby(method, function=[count(as=method_total), groupby(statuscode, function=count(as=method_status_count))])]
In this query, a groupBy()
is nested within another
groupBy()
. The results will look like this:
method | method_total | statuscode | method_status_count |
---|---|---|---|
GET | 14078 | 200 | 7326 |
GET | 14078 | 301 | 1246 |
GET | 14078 | 304 | 65 |
GET | 14078 | 400 | 70 |
GET | 14078 | 404 | 5371 |
HEAD | 139 | 200 | 14 |
HEAD | 139 | 301 | 22 |
HEAD | 139 | 400 | 6 |
HEAD | 139 | 404 | 97 |
POST | 2005 | 200 | 14 |
POST | 2005 | 301 | 1002 |
POST | 2005 | 400 | 25 |
POST | 2055 | 404 | 964 |
These results still might not be as tidy as you like. You might do better with separate queries for each method or each status code — assuming you're interested in only a few specific ones. Another possibility might be to use an API to be able to assemble the results in a manner that you prefer.
Instead of a single function, additional query functions can be also
applied, allowing for calculations on the count that is obtained from
groupBy()
. The following query contains an embedded
expression within groupBy()
, as a demonstration of
formatting/calculating the value within a groupBy()
expression:
groupBy(host, function=[{count()
| esp:=_count/300}])
The query generates a count()
of the number of
hosts in the aggregation, which creates a field
_count which is then divided by 300
and placed into the field esp.