groupBy()

`groupBy()`

The groupBy() query function is used to group together events by one or more specified fields. This is similar to the GROUP BY method in SQL databases.

Parameter	Type	Required	Default Value	Description
`field`^[a]	array of strings	required		This specifies which field or fields to group.
`function`	array of aggregate functions	optional^[b]	`count(as=_count)`	This specifies which aggregate functions to use with each group. If several aggregators are listed for the `function` parameter, then their outputs are combined using the rules described for `stats()`.
`limit`	string	optional^[b]	`20,000`	This sets the limit for the number of group elements. Default value controlled by `GroupDefaultLimit` dynamic configuration, max value controlled by `GroupMaxLimit`. If the argument is `max` (`limit=max`), then the value of `GroupMaxLimit` is used. It prioritizes the top-N series. The top N value being the series with the highest numerical value attributed to it by the subquery across all fields.
			Values
			`max`	The value of `GroupMaxLimit`
		Maximum	`1,000,000`
		Controlling Variables
		`GroupMaxLimit`		Variable default: `1,000,000 group elements`
^[a]The parameter name `field` can be omitted. ^[b]Optional parameters use their default value unless explicitly set.

Hide omitted argument names for this function

Show omitted argument names for this function

Omitted Argument Names
The argument name for field can be omitted; the following forms of this function are equivalent:
logscale Syntax
groupBy([a,b])
and:
logscale Syntax
groupBy(field=[a,b])
These examples show basic structure only.

`groupBy()` Function Operation

The groupBy() function has specific implementation and operational considerations, outlined below.

When groupBy() reaches its limit, each group may not have processed all relevant events. Consequently, when using count as the subaggregate function,groupBy() will produce a lower bound estimate.

Series Selection in `groupBy()`

The selection is based on the aggregate numerical output across all specified functions and all time buckets, not the series identifiers themselves.

The limit prioritizes the top-N series. The top N value being the series with the highest numerical value attributed to it by the subquery across all fields.

Series Selection Process:

The selection is based on the numerical values produced by the subquery/function.
It is not based on the series names.
When multiple functions are used, the function considers all values produced.

For different examples of top N series selection, see Find Top N Value of Series - Example 1 and Find Top N Value of Series - Example 2.

Grouping in `groupBy()`

The groupBy() can be used to execute aggregate functions on each group. The fields generated are grouped by the field or fields in the field parameter for each aggregate function. For example, by default the count() function is used and the grouped count returned in the _count field by default.

The default is to use count() as an aggregate function. However, if the goal is to only find the unique values of a field or combinations of multiple fields, then groupBy() can be provided with an empty list as its aggregate function.

When showing time series data, the timeChart() and bucket() functions are an extension of groupBy() that groups by time.

Limits when using `groupBy()`

Warning

The groupBy() may return incomplete results limited to a restricted subset of groups and related events, in cases where the limit parameter is exceeded.

The groupBy() function is limited in the number of groups it will handle. Such limit is set in the LogScale configuration using GraphQL mutations bound by the GroupMaxLimit dynamic configuration.

Starting from version 1.127, the maximum value of the limit parameter is 1,000,000 by default (adjustable by the corresponding GroupMaxLimit dynamic configuration), which means it is possible to create up to 1 Million groups.

When the limit parameter is not supplied explicitly, its default value is determined by the GroupDefaultLimit dynamic configuration, which has a default value of 20,000.

GroupDefaultLimit cannot be higher than GroupMaxLimit.

Attention

The limit parameter is governed by the dynamic configurations GroupMaxLimit and GroupDefaultLimit to the same value for a seamless upgrade.

Additionally, like all LogScale functions, groupBy() has an internal memory limit determined by the dynamic configuration QueryCoordinatorMemoryLimit. This means that it is possible for the function to collect less than the specified limit number of groups, if the total amount of data collected by the function exceeds this limit. The internal memory limit is in place to protect the cluster health, and ensure sufficient memory for other queries to run. Should the memory limit be reached, no more groups are added. This may lead to results that are inconsistent with the results shown prior to the limit being reached. The groups that were present in prior results may be pushed out of the result set during merges between nodes, to ensure that the memory usage is kept within bounds. If a group is shown, the results for that group will be correct. In situations where you reach the memory limit, you may want to use the top() function to narrow the list of groups.

One of the reasons for this limitation is because groupBy() is implemented to work entirely in memory, and cannot spill to disk. The limit helps to prevent searches from consuming too much memory. This method and limitation may be changed in future versions of LogScale.

`groupBy()` Syntax Examples

Since the most common use-case is to count the distinct values of a field, the default behavior is to use count() to aggregate the events. The simplest use-case of groupBy() is, therefore, to count the occurrence of the distinct values of a field:

logscale

groupBy(status_code)

Given input:

status_code	ip
440	1.111.111.111
500	1.111.111.111
440	1.111.111.111
440	2.222.222.222

The output is:

status_code	_count
440	3
500	1

The task is to only find the unique values, groupBy() can be provided with an empty list as the aggregate function, implying that nothing will be aggregated:

logscale

groupBy(status_code, function=[])

The output is:

status_code
440
500

Similarly, groupBy() can be used to find the unique combinations of field values present in the data:

logscale

groupBy([status_code, ip], function=[])

The output is:

status_code	ip
440	1.111.111.111
500	1.111.111.111
440	2.222.222.222

A usage of the function such as the one above is also very useful for query-based parameters in dashboards, see Query Parameter for more explanations.

As an example of how to use this query function, suppose you have a LogScale repository that is ingesting data from one or more web servers. Now suppose you want to get a list of HTTP status codes, the results of users accessing the server. For instance, HTTP code 404 for Not Found — for web pages not found. You could get a list of all of the HTTP codes returned and a count of each by executing a query that counts the different http status codes, like this:

logscale

groupBy(field=status_code, function=count())
| sort(statuscode, order=asc)

In addition to the groupBy(), the results here are piped to the sort() function to sort the results from the lowest HTTP code to the highest.

statuscode	_count
200	7485
301	2314
304	65
400	101
404	6425
405	2
408	9

With the groupBy() function, you can have LogScale group by more than one field. You would just give the fields in an array, a comma-separated list within square-brackets. It would look something like this query in which events are grouped on HTTP method and status code:

logscale

groupBy(field=[method, statuscode], function=count())
| sort([method, statuscode], order=asc)

Will produce these results:

method	statuscode
CONNECT	301
GET	200
GET	301
GET	304
GET	400
GET	404
HEAD	200
HEAD	301
HEAD	400
HEAD	404
OPTIONS	200
OPTIONS	301
POST	200
POST	301
POST	400
POST	404

Although this query works, you may also want to show the total for each method, not just the total for each status code of each method. To do that, you'll have to adjust the query to look like this:

logscale

[groupBy(method, function=[count(as=method_total), groupBy(statuscode, function=count(as=method_status_count))])]

In this query, a groupBy() is nested within another groupBy(). The results will look like this:

method	method_total	statuscode	method_status_count
GET	14078	200	7326
GET	14078	301	1246
GET	14078	304	65
GET	14078	400	70
GET	14078	404	5371
HEAD	139	200	14
HEAD	139	301	22
HEAD	139	400	6
HEAD	139	404	97
POST	2005	200	14
POST	2005	301	1002
POST	2005	400	25
POST	2055	404	964

These results still might not be as tidy as you like. You might do better with separate queries for each method or each status code — assuming you're interested in only a few specific ones. Another possibility might be to use an API to be able to assemble the results in a manner that you prefer.

Instead of a single function, additional query functions can be also applied, allowing for calculations on the count that is obtained from groupBy(). The following query contains an embedded expression within groupBy(), as a demonstration of formatting/calculating the value within a groupBy() expression:

logscale

groupBy(host, function=[{count()
| esp:=_count/300}])

The query generates a count() of the number of hosts in the aggregation, which creates a field _count which is then divided by 300 and placed into the field esp.

`groupBy()` Examples

Click + next to an example below to get the full details.

Hourly Data Events

Summarize events by providing a count of the number of data events per hour using the time:hour() function

Query

logscale

hr := time:hour(field="@ingesttimestamp")
|groupBy(hr)

Introduction

In this example, the time:hour() function is used with groupBy() to average the count of data events per hour.

Step-by-Step

Starting with the source repository events.
logscale
```
hr := time:hour(field="@ingesttimestamp")
```
Gets the hour (24-hour clock) of the values in the @ingesttimestamp and returns the results in a new field named hr.
logscale
```
|groupBy(hr)
```
Groups the returned results by hr field and provides a count of the number of data events per hour in a _count field.
Event Result set.

Summary and Results

The query is used to average the count of data events per hour. The results can be plotted onto a bar chart.

Sort Timestamps

Sort fields based on aggregated field values using the groupBy() function

Query

logscale

timestamp := formatTime(format="%H:%M")
| groupBy([thread],
function=[{sort("timestamp")
| collect("timestamp")}])

Introduction

When using aggregation, you may want to sort on a field that is part of the aggregated set but not the main feature of the aggregated value. For example, sorting the values by their timestamp rather than the embedded value. To achieve this, you should use a function that sorts the field to be used as the sort field, and then use collect() so that the value from before the aggregation can be displayed in the generated event set. This query can be executed in the humio respository.

Step-by-Step

Starting with the source repository events.
logscale
```
timestamp := formatTime(format="%H:%M")
```
Creates a new field, timestamp formatted as HH:MM.
logscale
```
| groupBy([thread],
```
Groups the events, first by the name of the thread and then the formatted timestamp.
logscale
```
function=[{sort("timestamp")
| collect("timestamp")}])
```
Uses the sort() combined with collect() as the method fo aggregation. As an embedded expression for the function, this will sort the events on the timestamp field and then retrieve the field as it would normally be removed as part of the aggregation process.
Event Result set.

Summary and Results

The result set will contain a list of the aggregated thread names sorted by the timestamp:

thread	timestamp
BootstrapInfoJob	10:09
DataSynchJob	10:09
Global event loop	10:10
LocalLivequeryMonitor	10:09
LogCollectorManifestUpdate	10:09
TransientChatter event loop	10:10
aggregate-alert-job	10:09
alert-job	10:09
block-processing-monitor-job	10:09
bloom-scheduler	10:09
bucket-entity-config	10:09
bucket-overcommit-metrics-job	10:09
bucket-storage-download	10:09
bucket-storage-prefetch	10:09
chatter-runningqueries-logger	10:09
chatter-runningqueries-stats	10:09

Deduplicate Content by Field

Deduplicating content based on a specific field using the groupBy() function with tail()

Query

logscale

groupBy(field, function=tail(1))

Introduction

If you want to deduplicate events by a given field, for example to identify a unique list of events for further processing, you can use an aggregate function. In this example, the groupBy() function is used with tail() to use the last value in a sequence of events.

Example incoming data might look like this:

@timestamp	user	status	ip_address
2025-11-06T10:00:00.000Z	alice	active	192.168.1.100
2025-11-06T10:15:00.000Z	bob	inactive	192.168.1.101
2025-11-06T10:30:00.000Z	alice	inactive	192.168.1.102
2025-11-06T10:45:00.000Z	bob	active	192.168.1.103
2025-11-06T11:00:00.000Z	alice	active	192.168.1.104

Step-by-Step

Starting with the source repository events.
logscale
```
groupBy(field, function=tail(1))
```
Groups all events in a specific field, and reduces the results using tail() to take only the last value.
Event Result set.

Summary and Results

The query is used to deduplicate events by a given field. This is useful if you want to create a unique list of events for further processing.

Sample output from the incoming example data where field=user:

@timestamp	user	status	ip_address
2025-11-06T11:00:00.000Z	alice	active	192.168.1.104
2025-11-06T10:45:00.000Z	bob	active	192.168.1.103

Note that only the last event for each unique value in the user field is kept in the results, while earlier events for the same user are removed.

Calculate Events per Second by Host

Determine event rate for each host over a 5-minute period using an embedded expression within the groupBy() function

Query

logscale

groupBy(host, function=[{count() | esp:=_count/300}])

Introduction

In this example, the groupBy() function is used with an embedded expression to calculate the total event count and events per second for each host over a 5-minute period.

Example incoming data might look like this:

@timestamp	host	service	status	response_time
2025-08-06T10:00:00Z	server1	web	ok	120
2025-08-06T10:00:01Z	server2	database	ok	85
2025-08-06T10:00:02Z	server1	web	ok	95
2025-08-06T10:00:03Z	server3	cache	ok	45
2025-08-06T10:00:04Z	server2	database	error	250
2025-08-06T10:00:05Z	server1	web	ok	110
2025-08-06T10:00:06Z	server3	cache	ok	40
2025-08-06T10:00:07Z	server2	database	ok	90
2025-08-06T10:00:08Z	server1	web	error	300
2025-08-06T10:00:09Z	server3	cache	ok	42

Step-by-Step

Starting with the source repository events.
logscale
```
groupBy(host, function=[{count() | esp:=_count/300}])
```
Groups events by the host field and uses an embedded expression to count the number of events per host and calculate events per second in one operation.
The count() function returns the count in a field named _count by default. The embedded expression then divides this value by 300 and stores the result in a new field named esp. This calculation provides the average events per second for each host over the time period.
Using an embedded expression within the groupBy() function performs both the count and the calculation as part of the same aggregation. Both the original _count and the calculated esp field are included in the results.
Event Result set.

Summary and Results

The query is used to analyze event frequency patterns by calculating both the total event count and the average events per second for each host.

Note that the query aggregates against both the original count and the count/300 as one aggregate set. Using an embedded expression is more efficient for larger event sets.

This query is useful, for example, to monitor system load distribution across hosts, identify hosts with unusual event rates, or establish baseline activity patterns for capacity planning.

Sample output from the incoming example data:

host	_count	esp
server1	4	0.013333
server2	3	0.010000
server3	3	0.010000

Note that the _count field shows the total number of events per host, and the esp field shows the calculated events per second (total events divided by 300 seconds)

This data is ideal for visualization using a Time Chart widget to show event rates over time. A Bar Chart widget could compare event rates across hosts, while a Gauge widget could show current event rates against predefined thresholds. Consider creating a dashboard that combines these visualizations with alerts for when event rates exceed normal ranges.

Filter Out Fields With No Value

Filter out fields with no values from search results using the groupBy with sort()

Query

logscale

method=GET
groupBy(field=[method, statuscode], function=count(as=method_total))
sort([method, statuscode], order=asc)
FieldName!=""

Introduction

It is possible to filter out on fields with no values in a given returned search result. In this example, all statuscode fields containing no value is filtered out from the final search result.

Example incoming data might look like this:

method	statuscode	method_total
GET	<no value>	10
GET	200	32492
GET	301	1
GET	304	113
GET	403	9
GET	404	132

Step-by-Step

Starting with the source repository events.
logscale
```
method=GET
```
Filters for all events with methods of the type GET.
logscale
```
groupBy(field=[method, statuscode], function=count(as=method_total))
```
Groups the returned results into a method field and a statuscode field and makes a count of the events in a new field named method_total.
logscale
```
sort([method, statuscode], order=asc)
```
Sorts the returned results in ascending order.
logscale
```
FieldName!=""
```
Excludes all events where one of the fields do not contain a value.
Event Result set.

Summary and Results

The query is used to filter out fields not containing any values from the returned search result.

Sample output from the incoming example data:

method	statuscode	method_total
GET	200	32492
GET	301	1
GET	304	113
GET	403	9
GET	404	132

Group Events by Single Field

Basic grouping of events by status_code field using the groupBy() function

Query

logscale

groupBy(status_code)

Introduction

In this example, the groupBy() is used to group events by their status codes to analyze the distribution of different response statuses.

Example incoming data might look like this:

@timestamp	status_code	endpoint	response_time
1686837825000	200	/api/users	145
1686837826000	404	/api/products	89
1686837827000	200	/api/orders	167
1686837828000	500	/api/payment	890
1686837829000	200	/api/users	156
1686837830000	404	/api/items	78
1686837831000	200	/api/orders	178
1686837832000	500	/api/checkout	923
1686837833000	200	/api/products	134
1686837834000	404	/api/users	92

Step-by-Step

Starting with the source repository events.
logscale
```
groupBy(status_code)
```
Groups events by unique values in the status_code field. When used without any aggregate functions, groupBy() automatically creates a field named _count showing the number of events for each unique value.
It is the same as: groupBy(status_code, function=count())
Event Result set.

Summary and Results

The query is used to analyze the distribution of status codes across all events.

The _count field is automatically added to show the number of events in each group.

This query is useful, for example, to monitor system health, identify error patterns, or track the frequency of different response types in a service.

For other examples with groupBy(), see groupBy() Syntax Examples.

Sample output from the incoming example data:

status_code	_count
200	5
404	3
500	2

Group HTTP Methods and Status Codes Using Nested `groupBy()`

Analyze HTTP traffic patterns by method and status code using the groupBy() function

Query

logscale

groupBy(method, function=[count(as=method_total), groupBy(statuscode, function=count(as=method_status_count))])

Introduction

In this example, the groupBy() function is used to analyze HTTP traffic patterns by grouping requests first by HTTP method and then by status code within each method.

Example incoming data might look like this:

@timestamp	method	statuscode	path	bytes
2025-08-06T10:00:00Z	GET	200	/index.html	1024
2025-08-06T10:00:01Z	POST	201	/api/users	512
2025-08-06T10:00:02Z	GET	404	/missing.html	256
2025-08-06T10:00:03Z	GET	200	/about.html	768
2025-08-06T10:00:04Z	POST	400	/api/users	128
2025-08-06T10:00:05Z	PUT	200	/api/users/1	384
2025-08-06T10:00:06Z	GET	200	/contact.html	896
2025-08-06T10:00:07Z	DELETE	204	/api/users/2	0
2025-08-06T10:00:08Z	GET	500	/error.html	1024
2025-08-06T10:00:09Z	POST	201	/api/orders	756

Step-by-Step

Starting with the source repository events.
logscale
```
groupBy(method, function=[count(as=method_total), groupBy(statuscode, function=count(as=method_status_count))])
```
Groups events by the method field and performs two aggregations:
- Counts total events for each HTTP method using count(), and returns the result in a new field named method_total.
- Creates a nested grouping by statuscode within each method group, counting occurrences using count() and returns the result in a new field named method_status_count.
Event Result set.

Summary and Results

The query is used to analyze HTTP traffic patterns by providing a hierarchical view of request methods and their associated status codes.

This query is useful, for example, to identify patterns in API usage, detect potential issues with specific HTTP methods, or monitor the distribution of success and error responses across different request types.

Sample output from the incoming example data:

method	method_total	statuscode	method_status_count
GET	5	200	3
GET	5	404	1
GET	5	500	1
POST	3	201	2
POST	3	400	1
PUT	1	200	1
DELETE	1	204	1

Note that the output shows the total count for each HTTP method in method_total and a breakdown of status codes and their counts within each method in method_status_count.

This data is well-suited for visualization using a Sankey diagram widget, which can effectively show the flow from HTTP methods to status codes.

List All EC2 Hosts With FirstSeen Data Within 14 Days

List all the EC2 hosts with FirstSeen data within 14 days using the groupBy() function with selectLast()

Query

logscale

#repo=sensor_metadata #data_source_name=aidmaster cloud.provider = "AWS_EC2_V2"
| groupBy([aid], function=(selectLast([event_platform, aid, ComputerName, AgentVersion, FirstSeen])), limit=max)
| FirstSeen := formatTime("%FT%T%z", field=FirstSeen)
| TimeDelta := now() - duration("14d")

Introduction

In this example, the groupBy() function is used with selectLast() to retrieve the latest information about AWS EC2 instances running CrowdStrike sensors, showing their platform, hostname, agent version, and when they were first seen, with a 14-day reference point for age comparison.

Example incoming data (raw data in sensor_metadata) might look like this:

@timestamp	aid	cloud.provider	event_platform	ComputerName	AgentVersion	FirstSeen
2025-05-20T10:00:00Z	1234abcd	AWS_EC2_V2	Windows	ec2-web-01	6.45.15678	2025-01-15T08:30:00Z
2025-05-21T11:00:00Z	1234abcd	AWS_EC2_V2	Windows	ec2-web-01	6.45.15679	2025-01-15T08:30:00Z
2025-05-22T12:00:00Z	5678efgh	AWS_EC2_V2	Linux	ec2-app-02	6.45.15678	2025-02-01T14:45:00Z
2025-05-23T13:00:00Z	5678efgh	AWS_EC2_V2	Linux	ec2-app-02	6.45.15679	2025-02-01T14:45:00Z
2025-05-24T14:00:00Z	90123ijk	AWS_EC2_V2	Windows	ec2-db-03	6.45.15678	2025-03-10T09:15:00Z
2025-05-25T15:00:00Z	90123ijk	AWS_EC2_V2	Windows	ec2-db-03	6.45.15679	2025-03-10T09:15:00Z

Step-by-Step

Starting with the source repository events.
logscale
```
#repo=sensor_metadata #data_source_name=aidmaster cloud.provider = "AWS_EC2_V2"
```
Searches in the sensor_metadata repository, and filters for #data_source_name fields containing the value aidmaster, looking for cloud.provider of the type AWS_EC2_V2 only.
logscale
```
| groupBy([aid], function=(selectLast([event_platform, aid, ComputerName, AgentVersion, FirstSeen])), limit=max)
```
Groups results by the field aid (Agent ID). Then, for each unique group, selects the most recent values for the fields: event_platform, aid, ComputerName, AgentVersion, FirstSeen.
Using the selectLast() within the groupBy() is what actually selects the most recent record for each group.
logscale
```
| FirstSeen := formatTime("%FT%T%z", field=FirstSeen)
```
Formats the timestamp in the FirstSeen field into ISO 8601 format. The result is stored back in the FirstSeen field.
logscale
```
| TimeDelta := now() - duration("14d")
```
Calculates timestamp from 14 days ago, and returns the results into a new field named TimeDelta. The calculation is done by subtracting a 14-day duration from the current time using duration().
This new TimeDelta field that represents a timestamp from 14 days ago, can be used for filtering or comparing against the FirstSeen timestamps.
Event Result set.

Summary and Results

The query is used to retrieve the latest information about AWS EC2 instances running CrowdStrike sensors, showing their platform, hostname, agent version, and when they were first seen, with a 14-day reference point for age comparison.

The query is useful, for example, for auditing EC2 instance coverage, identifying newly added EC2 instances within the last two weeks, monitoring sensor versions or identifying aging or outdated installations.

Sample output from the incoming example data:

aid	event_platform	ComputerName	AgentVersion	FirstSeen	TimeDelta
1234abcd	Windows	ec2-web-01	6.45.15679	2025-01-15T08:30:00+0000	2025-05-12T13:06:56+0000
5678efgh	Linux	ec2-app-02	6.45.15679	2025-02-01T14:45:00+0000	2025-05-12T13:06:56+0000
90123ijk	Windows	ec2-db-03	6.45.15679	2025-03-10T09:15:00+0000	2025-05-12T13:06:56+0000

Each aid appears only once with its most recent values. Note that TimeDelta value is based on the current date provided (Mon, 26 May 2025 13:06:56 GMT).

Data Analysis Overview

LogScale User Interface

Manage Repositories and Views

Parse Data

Search Data

Write Queries

Query Language Syntax

Query Functions

Automation

Template Language

Keyboard Shortcuts

groupBy() Function Operation

Series Selection in groupBy()

Grouping in groupBy()

Limits when using groupBy()

Warning

groupBy() Syntax Examples

groupBy() Examples

Hourly Data Events

Query

Introduction

Step-by-Step

Summary and Results

Sort Timestamps

Query

Introduction

Step-by-Step

Summary and Results

Deduplicate Content by Field

Query

Introduction

Step-by-Step

Summary and Results

Calculate Events per Second by Host

Query

Introduction

Step-by-Step

Summary and Results

Filter Out Fields With No Value

Query

Introduction

Step-by-Step

Summary and Results

Group Events by Single Field

Query

Introduction

Step-by-Step

Summary and Results

Group HTTP Methods and Status Codes Using Nested groupBy()

Query

Introduction

Step-by-Step

Summary and Results

List All EC2 Hosts With FirstSeen Data Within 14 Days

Query

Introduction

Step-by-Step

Summary and Results

Enter search term

`groupBy()`

`groupBy()` Function Operation

Series Selection in `groupBy()`

Grouping in `groupBy()`

Limits when using `groupBy()`

`groupBy()` Syntax Examples

`groupBy()` Examples

Group HTTP Methods and Status Codes Using Nested `groupBy()`