The groupBy() query function is used to group together events by one or more specified fields. This is similar to the GROUP BY method in SQL databases. Further, it can be used to execute aggregate functions on each group. The results are returned in the field parameter for each aggregate function. For example, the _count field if the count() is used.

When showing time series data, the timeChart() and bucket() functions are an extension of groupBy() that groups by time.

ParameterTypeRequiredDefaultDescription
field[string]true This specifies which field or fields to group.
function[Aggregate]falsecount(as=_count)This specifies which aggregate functions to use with each group. The default is count().
limitstringfalseGroupDefaultLimit (i.e., 20,000)This sets the limit for the number of group elements. The value, if specified, can be either a positive integer or "max", in which case the value of GroupMaxLimit (i.e. 200,000) is used.

The groupBy() function is limited in the number of groups it will handle. Such limit is set in the Humio configuration using GraphQL mutations bound by the GroupMaxLimit dynamic configuration.

The maximum value of the limit parameter for this dynamic configuration is 200,000 by default (adjustable by the dynamic configuration), which means it is possible to create up to 200k groups.

When the limit parameter is not supplied explicitly, its default value is determined by the GroupDefaultLimit dynamic configuration, which has a default value of 20,000.

GroupDefaultLimit cannot be larger than MAX_STATE_LIMIT.

Attention

As the limit parameter was previously bounded by MAX_STATE_LIMIT environment variable, if you have made any modifications to it, please set the dynamic configurations GroupMaxLimit and GroupDefaultLimit to the same value for a seamless upgrade.

Additionally, groupBy() has an internal memory limit determined by the dynamic configuration QueryMemoryLimit for non-live queries, and LiveQueryMemoryLimit for live queries. Both of these have a default value of 100MB. If either kind of limit is reached in a search query, additional groups won’t be created. The results returned will then be inconsistent because they were merged together randomly; which groups are kept will be random, as well. For such situations, you may want to use the top() function to narrow the list of groups.

One of the reasons for this limitation is because groupBy() is implemented to work entirely in memory, and cannot spill to disk. The limit helps to prevent searches from consuming too much memory. This method and limitation may be changed in future versions of Humio.

Examples

As an example of how to use this query function, suppose you have a Humio repository that is ingesting data from one or more web servers. Now suppose you want to get a list of HTTP status codes, the results of users accessing the server. For instance, HTTP code 404 for Not Found — for web pages not found. You could get a list of all of the HTTP codes returned and a count of each by executing a query like this:

Count different http status codes

humio
groupBy(field=status_code, function=count())
| sort(statuscode, order=asc)

In addition to the groupBy(), the results here are piped to the sort() function to sort the results from the lowest HTTP code to the highest. This query will look something like in figure here below:

groupBy() Example

Figure 256. groupBy() Example


With the groupBy() function, you can have Humio group by more than one field. You would just give the fields in an array, a comma-separated list within square-brackets. It would look something like this query in which events are grouped on HTTP method and status code:

humio
groupBy(field=[method, statuscode], function=count())
| sort([method, statuscode], order=asc)

The figure below shows you these results.

groupBy() Example

Figure 257. groupBy() Example


Although this query works, you may also want to show the total for each method, not just the total for each status code of each method. To do that, you'll have to adjust the query to look like this:

humio
[groupby(method, function=[count(as=method_total),
    groupby(statuscode, function=count(as=method_status_count))])]

In this query, a groupBy() is nested within another groupBy(). The results will look like in figure here:

groupBy() Example

Figure 258. groupBy() Example


These results still might not be as tidy as you like. You might do better with separate queries for each method or each status code — assuming you're interested in only a few specific ones. Another possibility might be to use an API to be able to assemble the results in a manner that you prefer.