Collects fields from multiple events into one event. It has a limit of 1Kb per key when used as part of a groupBy() operation. This limits the number of values you can index during the aggregation.

ParameterTypeRequiredDefault ValueDescription
fields[a]array of stringsrequired   Names of the fields to keep.
limitintegeroptional[b] 2000 Limit to number of distinct values in collect.
  Minimum1 
multivalbooleanoptional[b] true Collects the resulting value as multivalue (a single field value using separator).
separatorstringoptional[b] \n Separator used for multiple values.

[a] The parameter name fields can be omitted.

[b] Optional parameters use their default value unless explicitly set.

Hide omitted argument names for this function

Show omitted argument names for this function

The collect() function is limited in the memory for while collecting data before the data is aggregated. The limit changes depending on whether collect() runs as a top level function — in which case its limit is 10 MiB:

logscale
#type = humio #kind=logs
| collect(myField)

or whether it runs in a subquery, or as a sub-aggregator to another function — in which case its limit is 1 MiB:

logscale
#type=humio #kind=logs
groupBy(myField, function=collect(myOtherField))

Warning

Collecting the @timestamp field currently only works when a single timestamp exists. You can work around this restriction by renaming or making another field and collecting that instead, for example:

logscale
timestamp := @timestamp
| collect(timestamp)

If you do not need more than a single value, consider using the selectLast() function or setting limit=1, if you experience that the @timestamp field not having a value.

collect() Examples

Click + next to an example below to get the full details.

Collect and Group Events by Specified Field - Example 1

Collect and group events by specified field using collect() as part of a groupBy() operation

Query
logscale
groupBy(client_ip, function=session(maxpause=1m, collect([url])))
Introduction

In this example, the collect() function is used to collect visitors, each visitor defined as non-active after one minute.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(client_ip, function=session(maxpause=1m, collect([url])))

    Collects visitors (URLs), each visitor defined as non-active after one minute and returns the results in an array named client_ip. A count of the events is returned in a _count field.

  3. Event Result set.

Summary and Results

The query is used to collect fields from multiple events into one event. This query analyzes user behavior by grouping events into sessions for each unique client IP address. It then collects all URLs accessed during each session. Collecting should be used on smaller data sets to create a list (or set, or map, or whatever) when you actually need a list object explicitly (for example, in order to pass it on to some other API). This analysis is valuable for understanding user engagement, and identifying potential security issues based on unusual browsing patterns. Using collect() on larger data set may cause out of memory as it returns the entire data set.

Collect and Group Events by Specified Field - Example 2

Collect and group events by specified field using collect() as part of a groupBy() operation

Query
logscale
LocalAddressIP4 = * RemoteAddressIP4 = * aip = *
| groupBy([LocalAddressIP4, RemoteAddressIP4], function=([count(aip, as=aipCount, distinct=true), collect([aip])]))
Introduction

In this example, the collect() function is used to collect fields from multiple events.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    LocalAddressIP4 = * RemoteAddressIP4 = * aip = *

    Filters for all events where the fields LocalAddressIP4, RemoteAddressIP4 and aip are all present. The actual values in these fields do not matter; the query just checks for their existence.

  3. logscale
    | groupBy([LocalAddressIP4, RemoteAddressIP4], function=([count(aip, as=aipCount, distinct=true), collect([aip])]))

    Groups the returned results in arrays named LocalAddressIP4 and RemoteAddressIP4, collects all the AIPs (Adaptive Internet Protocol) into an array and performs a count on the field aip. The count of the AIP values is returned in a new field named aipCount.

  4. Event Result set.

Summary and Results

The query is used to collect fields from multiple events into one event. Collecting should be used on smaller data sets to create a list (or set, or map, or whatever) when you actually need a list object explicitly (for example, in order to pass it on to some other API). Using collect() on larger data set may cause out of memory as it returns the entire data set. The query is useful for network connection analysis and for identifying potential threats.

Sample output might look like this:

LocalAddressIP4RemoteAddressIP4aipCountaip
192.168.1.100203.0.113.503[10.0.0.1, 10.0.0.2, 10.0.0.3]
10.0.0.5198.51.100.751[172.16.0.1]
172.16.0.108.8.8.85[192.0.2.1, 192.0.2.2, 192.0.2.3, 192.0.2.4, 192.0.2.5]

Sort Timestamps With groupBy()

Sorting fields based on aggregated field values

Query

Search Repository: humio

logscale
timestamp := formatTime(format="%H:%M")
| groupBy([thread],
function=[{sort("timestamp")
| collect("timestamp")}])
Introduction

When using aggregation, you may want to sort on a field that is part of the aggregated set but not the main feature of the aggregated value. For example, sorting the values by their timestamp rather than the embedded value. To achieve this, you should use a function that sorts the field to be used as the sort field, and then use collect() so that the value from before the aggregaion can be displayed in the generated event set. This query can be executed in the humio respository.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    timestamp := formatTime(format="%H:%M")

    Creates a new field, timestamp formatted as HH:MM.

  3. logscale
    | groupBy([thread],

    Groups the events, first by the name of the thread and then the formatted timestamp.

  4. logscale
    function=[{sort("timestamp")
    | collect("timestamp")}])

    Uses the sort() combined with collect() as the method fo aggregation. As an embedded expression for the function, this will sort the events on the timestamp field and then retrieve the field as it would normally be removed as part of the aggregation process.

  5. Event Result set.

Summary and Results

The result set will contain a list of the aggregated thread names sorted by the timestamp:

threadtimestamp
BootstrapInfoJob10:09
DataSynchJob10:09
Global event loop10:10
LocalLivequeryMonitor10:09
LogCollectorManifestUpdate10:09
TransientChatter event loop10:10
aggregate-alert-job10:09
alert-job10:09
block-processing-monitor-job10:09
bloom-scheduler10:09
bucket-entity-config10:09
bucket-overcommit-metrics-job10:09
bucket-storage-download10:09
bucket-storage-prefetch10:09
chatter-runningqueries-logger10:09
chatter-runningqueries-stats10:09