Collects a set of values from one or more fields. The
collect()
function outputs these values in
two possible formats: either as a single concatenated string, or
as multiple rows with individual values.
The sequence of output values is undefined and does not follow any specific order.
The collect()
function operates under two
collection restrictions: A maximum number of values and a
maximum memory allocation.
The limit
parameter
defines the maximum number of values that can be collected.
The memory limit depends on where the function is used: 10 MiB
if collect()
is run as a top level function
and 1 MiB for all other contexts.
When the collect()
function exceeds either
limitation, it returns partial results and displays a warning
message that explains the situation.
Parameter | Type | Required | Default Value | Description |
---|---|---|---|---|
fields [a] | array of strings | required | Names of the fields to collect values for. | |
limit | integer | optional[b] | 2000 | Limit to number of distinct values to collect. |
Minimum | 1 | |||
multival | boolean | optional[b] | true | Whether to output values as a concatenation. |
Values | ||||
false | Output as multiple rows containing individual values | |||
true | Output as a single concatenated string | |||
separator | string | optional[b] | \n | Separator used when concatenating values (when multival=true ). |
[b] Optional parameters use their default value unless explicitly set. |
Hide omitted argument names for this function
Omitted Argument NamesThe argument name for
fields
can be omitted; the following forms of this function are equivalent:logscale Syntaxcollect(["value"])
and:
logscale Syntaxcollect(fields=["value"])
These examples show basic structure only.
collect()
Function Operation
The collect()
function is limited in the
memory for while collecting data before the data is
aggregated. The limit changes depending on whether
collect()
runs as a top level function
— in which case its limit is 10 MiB:
#type = humio #kind=logs
| collect(myField)
or whether it runs in a subquery, or as a sub-aggregator to another function — in which case its limit is 1 MiB:
#type=humio #kind=logs
groupBy(myField, function=collect(myOtherField))
Warning
Collecting the @timestamp field currently only works when a single timestamp exists. You can work around this restriction by renaming or making another field and collecting that instead, for example:
timestamp := @timestamp
| collect(timestamp)
If you do not need more than a single value, consider using
the selectLast()
function or setting
limit=1
, if you experience
that the @timestamp field not having a
value.
collect()
Examples
Click
next to an example below to get the full details.Collect and Group Events by Specified Field - Example 1
Collect and group events by specified field using
collect()
as part of a
groupBy()
operation
Query
groupBy(client_ip, function=session(maxpause=1m, collect([url])))
Introduction
In this example, the collect()
function is used to
collect visitors, each visitor defined as non-active after one minute.
Step-by-Step
Starting with the source repository events.
- logscale
groupBy(client_ip, function=session(maxpause=1m, collect([url])))
Collects visitors (URLs), each visitor defined as non-active after one minute and returns the results in an array named client_ip. A count of the events is returned in a _count field.
Event Result set.
Summary and Results
The query is used to collect fields from multiple events into one event.
This query analyzes user behavior by grouping events into sessions for
each unique client IP address. It then collects all URLs accessed during
each session. Collecting should be used on smaller data sets to create a
list (or set, or map, or whatever) when you actually need a list object
explicitly (for example, in order to pass it on to some other API). This
analysis is valuable for understanding user engagement, and identifying
potential security issues based on unusual browsing patterns. Using
collect()
on larger data set may cause out of
memory as it returns the entire data set.
Collect and Group Events by Specified Field - Example 2
Collect and group events by specified field using
collect()
as part of a
groupBy()
operation
Query
LocalAddressIP4 = * RemoteAddressIP4 = * aip = *
| groupBy([LocalAddressIP4, RemoteAddressIP4], function=([count(aip, as=aipCount, distinct=true), collect([aip])]))
Introduction
In this example, the collect()
function is used to
collect fields from multiple events.
Step-by-Step
Starting with the source repository events.
- logscale
LocalAddressIP4 = * RemoteAddressIP4 = * aip = *
Filters for all events where the fields LocalAddressIP4, RemoteAddressIP4 and aip are all present. The actual values in these fields do not matter; the query just checks for their existence.
- logscale
| groupBy([LocalAddressIP4, RemoteAddressIP4], function=([count(aip, as=aipCount, distinct=true), collect([aip])]))
Groups the returned results in arrays named LocalAddressIP4 and RemoteAddressIP4, collects all the AIPs (Adaptive Internet Protocol) into an array and performs a count on the field aip. The count of the AIP values is returned in a new field named aipCount.
Event Result set.
Summary and Results
The query is used to collect fields from multiple events into one event.
Collecting should be used on smaller data sets to create a list (or set,
or map, or whatever) when you actually need a list object explicitly
(for example, in order to pass it on to some other API). Using
collect()
on larger data set may cause out of
memory as it returns the entire data set. The query is useful for
network connection analysis and for identifying potential threats.
Sample output might look like this:
LocalAddressIP4 | RemoteAddressIP4 | aipCount | aip |
---|---|---|---|
192.168.1.100 | 203.0.113.50 | 3 | [10.0.0.1, 10.0.0.2, 10.0.0.3] |
10.0.0.5 | 198.51.100.75 | 1 | [172.16.0.1] |
172.16.0.10 | 8.8.8.8 | 5 | [192.0.2.1, 192.0.2.2, 192.0.2.3, 192.0.2.4, 192.0.2.5] |