Collects fields from multiple events into one event. It has a
limit of 1Kb per key when used as part of a
groupBy() operation. This limits the number
of values you can index during the aggregation.
[b] Optional parameters use their default value unless explicitly set.
Hide omitted argument names for this functionShow omitted argument names for this function
Omitted Argument Names
The argument name for fields can be omitted; the following forms of this function are equivalent:
logscale Syntax
collect(["value"])
and:
logscale Syntax
collect(fields=["value"])
These examples show basic structure only.
The collect() function is limited in the
memory for while collecting data before the data is aggregated.
The limit changes depending on whether
collect() runs as a top level function
— in which case its limit is 10 MiB:
logscale
#type =humio#kind=logs|collect(myField)
or whether it runs in a subquery, or as a sub-aggregator to
another function — in which case its limit is 1 MiB:
Collecting the @timestamp field currently
only works when a single timestamp exists. You can work around
this restriction by renaming or making another field and
collecting that instead, for example:
logscale
timestamp:=@timestamp|collect(timestamp)
If you do not need more than a single value, consider using
the selectLast() function or setting
limit=1, if you experience
that the @timestamp field not having a
value.
The collect() function can be used to collect
fields from multiple events into one event as part of a
groupBy() operation. The
groupBy() function is used to group together
events by one or more specified fields. It is used to extract
additional aggregations from the data and then add calculation to
it using the count()function. In this
example, the collect() function is used to
collect visitors, each visitor defined as non-active after one
minute.
Collects visitors (URLs), each visitor defined as non-active
after one minute and returns the results in an array named
client_ip. A count of the events is
returned in a _count field.
Event Result set.
Summary and Results
The query is used to collect fields from multiple events into
one event. This query analyzes user behavior by grouping events
into sessions for each unique client IP address. It then
collects all URLs accessed during each session. Collecting
should be used on smaller data sets to create a list (or set, or
map, or whatever) when you actually need a list object
explicitly (for example, in order to pass it on to some other
API). This analysis is valuable for understanding user
engagement, and identifying potential security issues based on
unusual browsing patterns. Using collect()
on larger data set may cause out of memory as it returns the
entire data set.
Collect and Group Events by Specified Field - Example 2
Collect and group events by specified field using collect() as part of a groupBy() operation
The collect() function can be used to collect
fields from multiple events into one event as part of a
groupBy() operation. The
groupBy() function is used to group together
events by one or more specified fields. It is used to extract
additional aggregations from the data and then add calculation to
it using the count()function. In this
example, the collect() function is used to
collect fields from multiple events.
Step-by-Step
Starting with the source repository events.
logscale
LocalAddressIP4=*RemoteAddressIP4=*aip=*
Filters for all events where the fields
LocalAddressIP4,
RemoteAddressIP4 and
aip are all present. The actual values in
these fields do not matter; the query just checks for their
existence.
Groups the returned results in arrays named
LocalAddressIP4 and
RemoteAddressIP4, collects all the AIPs
(Adaptive Internet Protocol) into an array and performs a count
on the field aip. The count of the AIP
values is returned in a new field named
aipCount.
Event Result set.
Summary and Results
The query is used to collect fields from multiple events into
one event. Collecting should be used on smaller data sets to
create a list (or set, or map, or whatever) when you actually
need a list object explicitly (for example, in order to pass it
on to some other API). Using collect() on
larger data set may cause out of memory as it returns the entire
data set. The query is useful for network connection analysis
and for identifying potential threats.
When using aggregation, you may want to sort on a field that is
part of the aggregated set but not the main feature of the
aggregated value. For example, sorting the values by their
timestamp rather than the embedded value. To achieve this, you
should use a function that sorts the field to be used as the sort
field, and then use collect() so that the
value from before the aggregaion can be displayed in the generated
event set. This query can be executed in the humio
respository.
Step-by-Step
Starting with the source repository events.
logscale
timestamp:=formatTime(format="%H:%M")
Creates a new field,
timestamp formatted as
HH:MM.
logscale
|groupBy([thread],
Groups the events, first by the name of the thread and then the
formatted timestamp.
Uses the sort() combined with
collect() as the method fo aggregation. As
an embedded expression for the function, this will sort the
events on the timestamp
field and then retrieve the field as it would normally be
removed as part of the aggregation process.
Event Result set.
Summary and Results
The result set will contain a list of the aggregated thread
names sorted by the timestamp: