The groupBy() query function is used to group together events by one or more specified fields. This is similar to the GROUP BY method in SQL databases.

ParameterTypeRequiredDefault ValueDescription
field[a]array of stringsrequired   This specifies which field or fields to group.
functionarray of aggregate functionsoptional[b] count(as=_count) This specifies which aggregate functions to use with each group. If several aggregators are listed for the function parameter, then their outputs are combined using the rules described for stats().
limitstringoptional[b] 20,000 This sets the limit for the number of group elements. Default value controlled by GroupDefaultLimit dynamic configuration, max value controlled by GroupMaxLimit. If the argument is max (limit=max), then the value of GroupMaxLimit is used. It prioritizes the top-N series. The top N value being the series with the highest numerical value attributed to it by the subquery across all fields.
  Maximum1,000,000 

[a] The parameter name field can be omitted.

[b] Optional parameters use their default value unless explicitly set.

Hide omitted argument names for this function

Show omitted argument names for this function

groupBy() Function Operation

The groupBy() function has specific implementation and operational considerations, outlined below.

Series Selection in groupBy()

The selection is based on the aggregate numerical output across all specified functions and all time buckets, not the series identifiers themselves.

The limit prioritizes the top-N series. The top N value being the series with the highest numerical value attributed to it by the subquery across all fields.

Series Selection Process:

  • The selection is based on the numerical values produced by the subquery/function.

  • It is not based on the series names.

  • When multiple functions are used, the function considers all values produced.

For different examples of top N series selection, see Find Top N Value of Series - Example 1 and Find Top N Value of Series - Example 2.

Grouping in groupBy()

The groupBy() can be used to execute aggregate functions on each group. The fields generated are grouped by the field or fields in the field parameter for each aggregate function. For example, by default the count() function is used and the grouped count returned in the _count field by default.

The default is to use count() as an aggregate function. However, if the goal is to only find the unique values of a field or combinations of multiple fields, then groupBy() can be provided with an empty list as its aggregate function.

When showing time series data, the timeChart() and bucket() functions are an extension of groupBy() that groups by time.

Limits when using groupBy()

Warning

The groupBy() may return incomplete results limited to a restricted subset of groups and related events, in cases where the limit parameter is exceeded.

The groupBy() function is limited in the number of groups it will handle. Such limit is set in the LogScale configuration using GraphQL mutations bound by the GroupMaxLimit dynamic configuration.

Starting from version 1.127, the maximum value of the limit parameter is 1,000,000 by default (adjustable by the corresponding GroupMaxLimit dynamic configuration), which means it is possible to create up to 1 Million groups.

When the limit parameter is not supplied explicitly, its default value is determined by the GroupDefaultLimit dynamic configuration, which has a default value of 20,000.

GroupDefaultLimit cannot be higher than GroupMaxLimit.

Attention

The limit parameter is governed by the dynamic configurations GroupMaxLimit and GroupDefaultLimit to the same value for a seamless upgrade.

Additionally, like all LogScale functions, groupBy() has an internal memory limit determined by the dynamic configuration QueryCoordinatorMemoryLimit. This means that it is possible for the function to collect less than the specified limit number of groups, if the total amount of data collected by the function exceeds this limit. The internal memory limit is in place to protect the cluster health, and ensure sufficient memory for other queries to run. Should the memory limit be reached, no more groups are added. This may lead to results that are inconsistent with the results shown prior to the limit being reached. The groups that were present in prior results may be pushed out of the result set during merges between nodes, to ensure that the memory usage is kept within bounds. If a group is shown, the results for that group will be correct. In situations where you reach the memory limit, you may want to use the top() function to narrow the list of groups.

One of the reasons for this limitation is because groupBy() is implemented to work entirely in memory, and cannot spill to disk. The limit helps to prevent searches from consuming too much memory. This method and limitation may be changed in future versions of LogScale.

groupBy() Syntax Examples

Since the most common use-case is to count the distinct values of a field, the default behavior is to use count() to aggregate the events. The simplest use-case of groupBy() is, therefore, to count the occurrence of the distinct values of a field:

logscale
groupBy(status_code)

Given input:

status_code ip
440 1.111.111.111
500 1.111.111.111
440 1.111.111.111
440 2.222.222.222

The output is:

status_code _count
440 3
500 1

The task is to only find the unique values, groupBy() can be provided with an empty list as the aggregate function, implying that nothing will be aggregated:

logscale
groupBy(status_code, function=[])

The output is:

status_code
440
500

Similarly, groupBy() can be used to find the unique combinations of field values present in the data:

logscale
groupBy([status_code, ip], function=[])

The output is:

status_code ip
440 1.111.111.111
500 1.111.111.111
440 2.222.222.222

A usage of the function such as the one above is also very useful for query-based parameters in dashboards, see Query Parameter for more explanations.

As an example of how to use this query function, suppose you have a LogScale repository that is ingesting data from one or more web servers. Now suppose you want to get a list of HTTP status codes, the results of users accessing the server. For instance, HTTP code 404 for Not Found — for web pages not found. You could get a list of all of the HTTP codes returned and a count of each by executing a query that counts the different http status codes, like this:

logscale
groupBy(field=status_code, function=count())
| sort(statuscode, order=asc)

In addition to the groupBy(), the results here are piped to the sort() function to sort the results from the lowest HTTP code to the highest.

statuscode_count
2007485
3012314
30465
400101
4046425
4052
4089

With the groupBy() function, you can have LogScale group by more than one field. You would just give the fields in an array, a comma-separated list within square-brackets. It would look something like this query in which events are grouped on HTTP method and status code:

logscale
groupBy(field=[method, statuscode], function=count())
| sort([method, statuscode], order=asc)

Will produce these results:

methodstatuscode
CONNECT301
GET200
GET301
GET304
GET400
GET404
HEAD200
HEAD301
HEAD400
HEAD404
OPTIONS200
OPTIONS301
POST200
POST301
POST400
POST404

Although this query works, you may also want to show the total for each method, not just the total for each status code of each method. To do that, you'll have to adjust the query to look like this:

logscale
[groupBy(method, function=[count(as=method_total), groupBy(statuscode, function=count(as=method_status_count))])]

In this query, a groupBy() is nested within another groupBy(). The results will look like this:

methodmethod_totalstatuscodemethod_status_count
GET140782007326
GET140783011246
GET1407830465
GET1407840070
GET140784045371
HEAD13920014
HEAD13930122
HEAD1394006
HEAD13940497
POST200520014
POST20053011002
POST200540025
POST2055404964

These results still might not be as tidy as you like. You might do better with separate queries for each method or each status code — assuming you're interested in only a few specific ones. Another possibility might be to use an API to be able to assemble the results in a manner that you prefer.

Instead of a single function, additional query functions can be also applied, allowing for calculations on the count that is obtained from groupBy(). The following query contains an embedded expression within groupBy(), as a demonstration of formatting/calculating the value within a groupBy() expression:

logscale
groupBy(host, function=[{count()
| esp:=_count/300}])

The query generates a count() of the number of hosts in the aggregation, which creates a field _count which is then divided by 300 and placed into the field esp.

groupBy()Examples

Click + next to an example below to get the full details.

Aggregate Array Content

Query
logscale
array:contains("incidents[]", value="Cozy Bear")
| groupBy(host)
Introduction

Given events containing an incidents array:

Event 1

|--------------|-------------|
| host         | v1          |
| incidents[0] | Evil Bear   |
| incidents[1] | Cozy Bear   |
|--------------|-------------|

Event 2

|--------------|-------------|
| host         | v15         |
| incidents[0] | Fancy Fly   |
| incidents[1] | Tiny Cat    |
| incidents[2] | Cozy Bears  |
|--------------|-------------|

Finds all the events where the field incidents contains the exact value Cozy Bear and group them by which hosts were affected, giving output event:

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    array:contains("incidents[]", value="Cozy Bear")

    Extracts elements from the array incidents from the field host that match the text Cozy Bear. The items will be output into the host field.

  3. logscale
    | groupBy(host)

    Groups the result events extracted from the array by the host.

  4. Event Result set.

Summary and Results

The result is an aggregated count of the array elements matching Cozy Bear.

fieldvalue
hostv1
_count1

Alert Query for Parsers Issues

Reporting errors

Query
logscale
#type=humio #kind=logs
| loglevel=WARN
| class = c.h.d.ParserLimitingJob
| "Setting reject ingest for"
| groupBy(id, function=[count(), min(@timestamp), max(@timestamp)] )
| timeDiff:=_max-_min
| timeDiff > 300000 and _count > 10
Introduction

This alert query tries to balance reacting when there are problems with parsers, without being too restrictive.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    #type=humio #kind=logs

    Filters on all logs across all hosts in the cluster.

  3. logscale
    | loglevel=WARN

    Filters for all events where the loglevel is equal to WARN.

  4. logscale
    | class = c.h.d.ParserLimitingJob

    Assigns the value c.h.d.ParserLimitingJob to the class for the logs having the loglevel value WARN.

  5. logscale
    | "Setting reject ingest for"

    Filters for events containing the string Setting reject ingest for. This is the error message generated when ingested events are rejected.

  6. logscale
    | groupBy(id, function=[count(), min(@timestamp), max(@timestamp)] )

    Groups the returned result by the field id, makes a count on the events and returns the minimum timestamp and maximum timestamp. This returns a new event set, with the fields id, _count, _min, and _max.

  7. logscale
    | timeDiff:=_max-_min

    Calculates the time difference between the maximum timestamp values and the minimum timestamp values and returns the result in a new field named timeDiff.

  8. logscale
    | timeDiff > 300000 and _count > 10

    Returns all events where the values of timeDiff is greater that 300000 and where there are more than 10 occurrences.

  9. Event Result set.

Summary and Results

This query is used to set up alerts for parsers issues. Setting up alerts for parsers issues will allow to proactively reach out to customers where their queries are being throttled and help them.

Analyze User Sessions Based on Click Activity

Analyzes user sessions based on users click activity using the session() function

Query
logscale
groupBy(cookie_id, function=session(maxpause=15m, count(as=clicks)))
| sort(clicks)
Introduction

In this example, the session() function is used to analyze user sessions based on users click activity. The session() function groups events by a given timespan.

Example incoming data might look like this:

timestampcookie_idaction_typepage_urluser_agent
2025-05-15 05:30:00user123pageview/homeMozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:30:15user123click/productsMozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:30:30user123click/product/item1Mozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:31:00user123click/cartMozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:31:30user123click/checkoutMozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:35:00user456pageview/homeMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
2025-05-15 05:35:30user456click/aboutMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
2025-05-15 05:36:00user456click/contactMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
2025-05-15 05:38:00user789pageview/homeMozilla/5.0 (iPhone; CPU iPhone OS 14_0)
2025-05-15 05:38:30user789click/productsMozilla/5.0 (iPhone; CPU iPhone OS 14_0)
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(cookie_id, function=session(maxpause=15m, count(as=clicks)))

    Groups events by the field cookie_id (unique user identifier) and creates sessions with 15-minute inactivity timeout (the default value of the maxpause parameter), then makes a count of each event in a session returning the result in a new field named clicks.

  3. logscale
    | sort(clicks)

    Sorts the results by number of clicks (default is descending order).

  4. Event Result set.

Summary and Results

The query is used to analyze user sessions based on the users click activity. The query is useful, for example, to identify most/least active user sessions, detect potential automated behavior or just to understand user engagement levels.

Sample output from the incoming example data:

cookie_idclicks
user1235
user4563
user7892

Note that each row represents an event (either pageview or click).

Calculate Query Cost for All Users by Repository

Search across multiple repositories to calculate query costs for all users by repository using sort() and groupBy() functions

Query
logscale
#type=humio #kind=logs class=c.h.j.RunningQueriesLoggerJob message="Highest Cost query"
 | groupBy(repo, initiatingUser, totalLiveCost, totalStaticCost)
 | sort([totalLiveCost, totalStaticCost])
Introduction

In this example, the query uses sort() and groupBy() functions to find query costs. The query filters logs in humio repository that are tagged with kind equal to logs and then returns the events where the class field has values containing c.h.j.RunningQueriesLoggerJob, searching for the specific value Highest Cost query.

Example incoming data might look like this:

#type#kindclassmessagetimestampdataspaceinitiatingUsertotalLiveCosttotalStaticCostdeltaTotalCostrepo
humiologsc.h.j.RunningQueriesLoggerJobHighest Cost query2025-03-26T09:30:00Zproductionjohn.doe15008002300security-logs
humiologsc.h.j.RunningQueriesLoggerJobHighest Cost query2025-03-26T09:31:00Zdevelopmentjane.smith200012003200app-logs
humiologsc.h.j.RunningQueriesLoggerJobHighest Cost query2025-03-26T09:32:00Zstagingbob.wilson10005001500infra-logs
humiologsc.h.j.RunningQueriesLoggerJobHighest Cost query2025-03-26T09:33:00Zproductionjohn.doe18009002700security-logs
humiologsc.h.j.RunningQueriesLoggerJobHighest Cost query2025-03-26T09:34:00Zdevelopmentjane.smith250013003800app-logs
humiologsc.h.j.RunningQueriesLoggerJobHighest Cost query2025-03-26T09:35:00Zstagingalice.cooper12006001800infra-logs
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    #type=humio #kind=logs class=c.h.j.RunningQueriesLoggerJob message="Highest Cost query"

    Filters for Humio internal logs containing c.h.j. RunningQueriesLoggerJob in the class field and where the value in the message field is equal to Highest Cost query.

  3. logscale
    | groupBy(repo, initiatingUser, totalLiveCost, totalStaticCost)

    Groups the results by the repo field, the initiatingUser field, and by both cost types (the fields totalLiveCost, totalStaticCost), and returns a count in a field named _count.

  4. logscale
    | sort([totalLiveCost, totalStaticCost])

    Sorts the results by both the totalLiveCost field and the totalStaticCost field, in descending order by default.

  5. Event Result set.

Summary and Results

The query is used to search across multiple repositories and output query costs for all users by repository. The query returns the count in a field named _count. Use this query to focus on live and static costs separately.

Sample output from the incoming example data:

repoinitiatingUsertotalLiveCosttotalStaticCost_count
app-logsjane.smith200012001
security-logsjohn.doe15008001
infra-logsbob.wilson10005001

Calculate Relationship Between X And Y Variables - Example 3

Calculate the linear relationship between server load and each of several types of request types using the linReg() function with bucket() and groupBy()

Query
logscale
bucket(function=[ avg(server_load_pct, as=y), groupBy(request_type, function=count(as=x)) ])
| groupBy(request_type, function=linReg(x=x, y=y))
Introduction

In this example, the linReg() function is used to calculate the linear relationship between request_type (x variable) and server_load_pct (y variable). The example shows the relationship between server load and each of several types of HTTP request types across time.

Example incoming data might look like this:

@timestampserver_load_pctrequest_type
2024-01-15T09:00:00.000Z45.2GET
2024-01-15T09:00:00.000Z45.2POST
2024-01-15T09:00:00.000Z45.2GET
2024-01-15T09:05:00.000Z52.8GET
2024-01-15T09:05:00.000Z52.8PUT
2024-01-15T09:05:00.000Z52.8POST
2024-01-15T09:10:00.000Z48.6GET
2024-01-15T09:10:00.000Z48.6GET
2024-01-15T09:10:00.000Z48.6DELETE
2024-01-15T09:15:00.000Z65.3POST
2024-01-15T09:15:00.000Z65.3POST
2024-01-15T09:15:00.000Z65.3GET
2024-01-15T09:20:00.000Z42.1GET
2024-01-15T09:20:00.000Z42.1PUT
2024-01-15T09:20:00.000Z42.1GET
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    bucket(function=[ avg(server_load_pct, as=y), groupBy(request_type, function=count(as=x)) ])

    Buckets the data points by time, then calculates the average server load for each time bucket returning the result in a field named y. It also groups the request types in a field named request_type and makes a count of requests by type in each time bucket returning the result in a field named x.

  3. logscale
    | groupBy(request_type, function=linReg(x=x, y=y))

    Correlates x with y, showing the relationship between the variables x and y for each HTTP request type and outputs the results in fields named _slope (slope value),_intercept (intercept value),_r2 (adjusted R-squared value), and _n (number of data points). These four key values indicate relationship strength and reliability.

  4. Event Result set.

Summary and Results

The query is used to analyze how different HTTP request types affect server load. The analysis helps identify which HTTP request types have the strongest impact on server performance.

Sample output from the incoming example data:

request_type_slope_intercept_r2_n
DELETE<no value><no value><no value><no value>
GET-13.74999999999994172.79999999999990.59418245743135925
POST16.2999999999999232.700000000000120.71962072424842383
PUT<no value><no value><no value><no value>

_slope is the impact rate of request volume on server load.

_intercept is the baseline server load when there are no requests of a specific type.

_r2 is the statistical accuracy of the relationship.

_n is the total number of data points analyzed.

Collect and Group Events by Specified Field - Example 1

Collect and group events by specified field using collect() as part of a groupBy() operation

Query
logscale
groupBy(client_ip, function=session(maxpause=1m, collect([url])))
Introduction

In this example, the collect() function is used to collect visitors, each visitor defined as non-active after one minute.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(client_ip, function=session(maxpause=1m, collect([url])))

    Collects visitors (URLs), each visitor defined as non-active after one minute and returns the results in an array named client_ip. A count of the events is returned in a _count field.

  3. Event Result set.

Summary and Results

The query is used to collect fields from multiple events into one event. This query analyzes user behavior by grouping events into sessions for each unique client IP address. It then collects all URLs accessed during each session. Collecting should be used on smaller data sets to create a list (or set, or map, or whatever) when you actually need a list object explicitly (for example, in order to pass it on to some other API). This analysis is valuable for understanding user engagement, and identifying potential security issues based on unusual browsing patterns. Using collect() on larger data set may cause out of memory as it returns the entire data set.

Collect and Group Events by Specified Field - Example 2

Collect and group events by specified field using collect() as part of a groupBy() operation

Query
logscale
LocalAddressIP4 = * RemoteAddressIP4 = * aip = *
| groupBy([LocalAddressIP4, RemoteAddressIP4], function=([count(aip, as=aipCount, distinct=true), collect([aip])]))
Introduction

In this example, the collect() function is used to collect fields from multiple events.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    LocalAddressIP4 = * RemoteAddressIP4 = * aip = *

    Filters for all events where the fields LocalAddressIP4, RemoteAddressIP4 and aip are all present. The actual values in these fields do not matter; the query just checks for their existence.

  3. logscale
    | groupBy([LocalAddressIP4, RemoteAddressIP4], function=([count(aip, as=aipCount, distinct=true), collect([aip])]))

    Groups the returned results in arrays named LocalAddressIP4 and RemoteAddressIP4, collects all the AIPs (Adaptive Internet Protocol) into an array and performs a count on the field aip. The count of the AIP values is returned in a new field named aipCount.

  4. Event Result set.

Summary and Results

The query is used to collect fields from multiple events into one event. Collecting should be used on smaller data sets to create a list (or set, or map, or whatever) when you actually need a list object explicitly (for example, in order to pass it on to some other API). Using collect() on larger data set may cause out of memory as it returns the entire data set. The query is useful for network connection analysis and for identifying potential threats.

Sample output might look like this:

LocalAddressIP4RemoteAddressIP4aipCountaip
192.168.1.100203.0.113.503[10.0.0.1, 10.0.0.2, 10.0.0.3]
10.0.0.5198.51.100.751[172.16.0.1]
172.16.0.108.8.8.85[192.0.2.1, 192.0.2.2, 192.0.2.3, 192.0.2.4, 192.0.2.5]

Count Unique Visitors Based on Client IP Addresses

Count unique visitors based on client IP addresses using the session() function

Query
logscale
groupBy(client_ip, function=session(maxpause=15m))
| count()
Introduction

In this example, the session() function is used to count the unique visitors (each visitor defined as non-active for 15 minutes) of a site based on client IP addresses. The session() function groups events by a given timespan.

Example incoming data might look like this:

timestampclient_ipurlstatus_codeuser_agent
2025-05-15 05:30:00192.168.1.100/login200Mozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:31:15192.168.1.100/dashboard200Mozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:32:30192.168.1.100/reports200Mozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:48:00192.168.1.100/login200Mozilla/5.0 (Windows NT 10.0; Win64; x64)
2025-05-15 05:30:05192.168.1.101/login200Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
2025-05-15 05:35:10192.168.1.101/profile200Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
2025-05-15 05:40:00192.168.1.102/login200Mozilla/5.0 (iPhone; CPU iPhone OS 14_0)
2025-05-15 05:41:30192.168.1.102/settings200Mozilla/5.0 (iPhone; CPU iPhone OS 14_0)
2025-05-15 05:42:45192.168.1.102/logout200Mozilla/5.0 (iPhone; CPU iPhone OS 14_0)
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(client_ip, function=session(maxpause=15m))

    Groups events by the field client_ip into sessions of 15 minutes. then makes a count of the total number of unique sessions

    The maxpause parameter defines the maximum pause between the sessions (15m in this example). Events more far apart than the defined value will become seperate sessions. For example, if the same user returns to a site within 15 minutes, it will be the same session.

  3. logscale
    | count()

    Makes a count of the total number of unique sessions.

  4. Event Result set.

Summary and Results

The query is used to group events by client IP addresses into sessions of 15m, and then make a count of the total number of unique sessions (returns the total count of sessions across all IP addresses). The query is, for example, useful for measuring unique website/application visitors and understanding real user engagement patterns. Also useful for security monitoring and detection of unusual spikes in unique visitors.

Sample output from the incoming example data:

_count
4

The query counts 4 unique sessions total as the first IP address has activity that spans beyond the 15-minute session timeout, creating two distinct sessions.

If you make the count on the client_ip field: | count(client_ip), the query will return a more detailed result showing the session count per IP address:

client_ip_count
192.168.1.1002
192.168.1.1011
192.168.1.1021

Create a Pivot Table

Creating a view of LogScale activity

Query
logscale
groupBy([type,actor.user.id],function={groupBy(actor.user.id, function=max(@timestamp))})
|transpose(header=type)
|drop(column)
Introduction

The humio-audit repository contains audit events for the LogScale cluster. Reporting on this information can provide a wealth of information, but a useful summary can be created based on the activities, users and which the latest user of that particular operation.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy([type,actor.user.id],function={groupBy(actor.user.id, function=max(@timestamp))})

    The first step to creating a pivot table is the base query that will create the initial summary of the information. In this fragment, a nested groupBy() aggregation. The embedded aggregation creates a group of the maximum access time for a given user, by using max() on the @timestamp against the actor.user.id. This creates a table of the last event by the user. The outer groupBy() then creates an aggregation of this maximum user time against the type which defines the operation performed.

    The result is a table of the last user and time for a specific operation; for example, the last time a query was executed. An example of this table can be seen below:

    type actor.user.id _max
    alert.clear-error 0O7WGPBX9YbvZbKOrBMd5fgH 1700546666592
    alert.create 0O7WGPBX9YbvZbKOrBMd5fgH 1699004139419
    alert.update 0O7WGPBX9YbvZbKOrBMd5fgH 1700546666676
    dashboard.create 0O7WGPBX9YbvZbKOrBMd5fgH 1698417330709
    dataspace.query 0O7WGPBX9YbvZbKOrBMd5fgH 1700721296197
  3. logscale
    |transpose(header=type)

    The transpose() will convert individual columns into rows, switching the orientation. For example, the type column will now become the type row. However, there are no row titles, so the title for the resulting table will by default create a header row containing the column and row numbers, like this:

    column row[1] row[2] row[3] row[4] row[5]
    _max 1700546666592 1699004139419 1700546666676 1698417330709 1700722209214
    actor.user.id 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH
    type alert.clear-error alert.create alert.update dashboard.create dataspace.query

    However, the aggregate grouping, type could be used instead as a valid header for each column. To achieve that, use the header parameter to specify type as the column. The resulting table now looks like this:

    alert.clear-error alert.create alert.update column dashboard.create dataspace.query
    1700546666592 1699004139419 1700546666676 _max 1698417330709 1700722210073
    0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH actor.user.id 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH
  4. logscale
    |drop(column)

    The table created contains the summarized information pivoted around the user ID and last event time further summarized by the type of the event. However, there is a column in the table, column, which is now a field in the event stream that was generated from the old row before the table was pivoted.

    That column can be removed by dropping the column field from the event using drop() to remove the column from the events.

  5. Event Result set.

Summary and Results

Pivoting an event set of data allows for the information to be displayed and summarized in a format that may make more logical sense as a display format. The final table will look like this:

alert.clear-error alert.create alert.update dashboard.create dataspace.query
1700546666592 1699004139419 1700546666676 1698417330709 1700722210073
0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH 0O7WGPBX9YbvZbKOrBMd5fgH

Deduplicate Content by Field

Deduplicating content based on a specific field

Query
logscale
groupBy(field, function=tail(1))
Introduction

If you want to deduplicate events by a given field, for example to identify a unique list of events for further processing, you can use an aggregate function. In this example, the groupBy() function is used with tail() to use the last value in a sequence of events.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(field, function=tail(1))

    Groups all events in a specific field, and reduces the results using tail() to take only the last value.

  3. Event Result set.

Summary and Results

The query is used to deduplicate events by a given field. This is useful if you want to create a unique list of events for further processing.

Detect All Occurrences of Event A Before Event B

Detect all occurrences of event A before event B (brute force attack) using the partition() function combined with groupBy()

Query
logscale
head()
| groupBy(
      key,
      function = partition(
          condition=test(status=="success"),
          split="after",
          [
              { status="failure" | count(as=failures) }, 
              range(@timestamp, as=timespan), 
              selectLast(status)
          ]
      )
  )
| failures >= 3
| status = "success"
Introduction

In this example, the partition() function is used with the groupBy() function to detect all occurrences of event A before event B (brute force attack).

The query will detect instances where there were 3 or more failed attempts followed by a successful attempt within the specified 10-second window.

Note that the partition() function must be used after an aggregator function to ensure event ordering. Also note that the events must be sorted in order by timestamp to prevent errors when running the query. It is possible to select any field to use as a timestamp.

Example incoming data might look like this:

@timestampkeystatus
1451606300200cfailure
1451606300400cfailure
1451606300600cfailure
1451606301000afailure
1451606302000afailure
1451606302200afailure
1451606302300afailure
1451606302400bfailure
1451606302500afailure
1451606302600asuccess
1451606303200bfailure
1451606303300csuccess
1451606303400bfailure
1451606304500afailure
1451606304600afailure
1451606304700afailure
1451606304800asuccess
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    head()

    Selects the oldest events ordered by time.

  3. logscale
    | groupBy(
          key,
          function = partition(
              condition=test(status=="success"),
              split="after",
              [
                  { status="failure" | count(as=failures) }, 
                  range(@timestamp, as=timespan), 
                  selectLast(status)
              ]
          )
      )

    Groups the events by a specified key (for example, a user ID or IP address), filters for successful events (filters for events that meet the defined condition for the field status that must contain the value success), then splits the data after each successful event. Notice how the condition is provided as a non-aggregate subquery.

    Furthermore, it filters all the failed attempts where the field status contains the value failure.

    Makes a count of all the failed attempts, and returns the results in a field named failures, calculates the timespan of the failures, and selects the status of the last event. Calculating the timespan of the failure sequence, is useful for analysis.

  4. logscale
    | failures >= 3

    Filters for partitions with 3 or more failures.

  5. logscale
    | status = "success"

    Filters for partitions containing the value success in the status field.

  6. Event Result set.

Summary and Results

The query is used to detect all occurrences of potential brute force attack patterns. It looks for instances where there were 3 or more failed attempts (event A) followed by a successful attempt (event B), regardless of the time between failures. The timespan between each attempt is reported, which could be used to identify brute force attacks.

Sample output from the incoming example data:

keyfailurestimespanstatus
a51600success
a3300success
c33100success

Detect Event A Happening X Times Before Event B

Detect event A happening X times before event B (brute force attack) using the partition() function combined with groupBy()

Query
logscale
head()
| groupBy(
    key,
    function = partition(
        condition=test(status=="success"),
        split="after",
        [
            { status="failure" | count(as=failures) }, 
            range(@timestamp, as=timespan), 
            max(@timestamp), 
            selectLast(status)
        ]
    )
)
| failures >= 3
| status = "success"
Introduction

In this example, the partition() function is used with the groupBy() function to detect event A happening X times before event B (brute force attack).

The query will detect instances where there were 3 or more failed attempts followed by a successful attempt within the specified 10-second window.

Note that the partition() function must be used after an aggregator function to ensure event ordering. Also note that the events must be sorted in order by timestamp to prevent errors when running the query. It is possible to select any field to use as a timestamp.

Example incoming data might look like this:

@timestampkeystatus
1451606300200cfailure
1451606300400cfailure
1451606300600cfailure
1451606301000afailure
1451606302000afailure
1451606302200afailure
1451606302300afailure
1451606302400bfailure
1451606302500afailure
1451606302600asuccess
1451606303200bfailure
1451606303300csuccess
1451606303400bfailure
1451606304500a<no value>
1451606304600cfailure
1451606304700cfailure
1451606304800cfailure
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    head()

    Selects the oldest events ordered by time.

  3. logscale
    | groupBy(
        key,
        function = partition(
            condition=test(status=="success"),
            split="after",
            [
                { status="failure" | count(as=failures) }, 
                range(@timestamp, as=timespan), 
                max(@timestamp), 
                selectLast(status)
            ]
        )
    )

    Groups the events by a specified key (for example, a user ID or IP address), then splits the sequence of events after each successful event (where the condition status=="success").

    For each partition, it counts the number of failure in status and stores it in the field failures, finds the range of timestamps in the partition, finds the newest timestamp, and finds the latest status to show if the partition ended with a success.

  4. logscale
    | failures >= 3

    Filters for partitions that contained 3 or more failures.

  5. logscale
    | status = "success"

    Filters for partitions with the value success in the status field to ensure that the final status is a success.

  6. Event Result set.

Summary and Results

The query is used to detect instances where there are 3 or more failed attempts followed by a successful attempt. The query can be used to detect a brute force attack where an attacker tries multiple times before succeeding. Note that the effectiveness of this query depends on the nature of your data and the typical patterns in your system.

Sample output from the incoming example data:

keyfailurestimespanstatus
a51600success
a3300success
c33100success

Detect Event A Happening X Times Before Event B Within a Specific Timespan

Detect event A happening X times before event B within a specific timespan using the slidingTimeWindow() function combined with groupBy()

Query
logscale
head()
| groupBy(
    key,
    function=slidingTimeWindow(
        [{status="failure" | count(as=failures)}, selectLast(status)],
        span=3s
    )
  )
| failures >= 3
| status = "success"
Introduction

In this example, the slidingTimeWindow() function is used with the groupBy() function to detect event A happening X times before event B within a specific timespan.

The query will detect instances where there are 3 or more failed attempts followed by a successful attempt, all occurring within a 3-second window.

Note that the slidingTimeWindow() function must be used after an aggregator function to ensure event ordering. Also note that the events must be sorted in order by timestamp to prevent errors when running the query. It is possible to select any field to use as a timestamp.

Example incoming data might look like this:

@timestampkeystatus
1451606300200cfailure
1451606300400cfailure
1451606300600cfailure
1451606301000afailure
1451606302000afailure
1451606302200afailure
1451606302300afailure
1451606302400bfailure
1451606302500afailure
1451606302600asuccess
1451606303200bfailure
1451606303300csuccess
1451606303400bfailure
1451606304500afailure
1451606304600afailure
1451606304700afailure
1451606304800asuccess
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    head()

    Selects the oldest events ordered by time.

  3. logscale
    | groupBy(
        key,
        function=slidingTimeWindow(
            [{status="failure" | count(as=failures)}, selectLast(status)],
            span=3s
        )
      )

    Groups the events by a specified key (for example, a user ID or IP address), then creates a sliding time window of 3 seconds (with a span of 3 seconds).

    Furthermore, it filters all the failed attempts where the field status contains the value failure, makes a count of all the failed attempts, and returns the results in a field named failures, calculates the timespan of the failures, retrieves the timestamp of the last failure, and selects the status of the last event.

  4. logscale
    | failures >= 3

    Filters for windows with 3 or more failures.

  5. logscale
    | status = "success"

    Filters for partitions containing the value success in the status field.

  6. Event Result set.

Summary and Results

The query is used to detect event A happening X times before event B within a specific timespan. It looks for instances where there were 3 or more failed attempts followed by a successful attempt, all occurring within a 3-second window. Using a sliding time window of 3 seconds, provides a more precise time constraint compared to the usage of partition() in Detect Event A Happening X Times Before Event B.

The query can be used to detect potential brute force attack patterns within a specific timeframe. Note that the effectiveness of this query depends on the nature of your data and the typical patterns in your system.

Sample output from the incoming example data:

keyfailuresstatus
a5success
a7success

Filter Out Fields With No Value

Filter out fields with no values from search results

Query
logscale
method=GET
groupBy(field=[method, statuscode], function=count(as=method_total))
sort([method, statuscode], order=asc)
FieldName!=""
Introduction

It is possible to filter out on fields with no values in a given returned search result. In this example, all statuscode fields containing no value is filtered out from the final search result.

Example incoming data might look like this:

methodstatuscodemethod_total
GET<no value>10
GET20032492
GET3011
GET304113
GET4039
GET404132

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    method=GET

    Filters for all events with methods of the type GET.

  3. logscale
    groupBy(field=[method, statuscode], function=count(as=method_total))

    Groups the returned results into a method field and a statuscode field and makes a count of the events in a new field named method_total.

  4. logscale
    sort([method, statuscode], order=asc)

    Sorts the returned results in ascending order.

  5. logscale
    FieldName!=""

    Excludes all events where one of the fields do not contain a value.

  6. Event Result set.

Summary and Results

The query is used to filter out fields not containing any values from the returned search result.

Sample output from the incoming example data:

methodstatuscodemethod_total
GET20032492
GET3011
GET304113
GET4039
GET404132

Find Fields With Data in Class

Query

Search Repository: humio

logscale
wildcard(field=class,pattern="*Data*")
| groupBy(class)
Introduction

Find all events containing any Data string in their class, and count the occurrences for each class that is found. For example, it can be used to get a list of events that have items such as DataIngestRateMonitor, or LocalDatasource.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    wildcard(field=class,pattern="*Data*")

    Searches the incoming data to list all events having Data (and everything around it) in their string.

  3. logscale
    | groupBy(class)

    Takes the events extracted from the search and groups them by the class field.

  4. Event Result set.

Summary and Results

The result is an aggregated count of all events matching anything with Data (with one or more characters before or after), in the class field.

class_count
c.h.c.c.ChatterDataMemoryStatusLoggerJob$283
c.h.d.DataIngestRateMonitor$7504
c.h.d.LocalDatasource$10352
c.h.d.q.EmptyIdleDatasourcesCleaner3
c.h.e.e.Datasource$3947
c.h.e.e.Datasources$4
c.h.e.f.DataSnapshotOps$662
c.h.e.f.DataWithGlobal7254
c.h.j.CleanupDatasourceFilesJob141
c.h.j.DataSyncJobImpl$46594
c.h.j.DatasourceRehashingJob$32
c.h.k.ChatterDataDistributionKafka$107

Find Fields With S3Bucket in Class

Query

Search Repository: humio

logscale
wildcard(field=class, pattern="*S3Bucket*", ignoreCase=true)
| groupBy(class)
Introduction

Find all events containing any S3Bucket item (and all before and after) in their class, and count the occurrences for each class that is found.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    wildcard(field=class, pattern="*S3Bucket*", ignoreCase=true)

    Searches the incoming data to list all events having S3Bucket (or everything around it, case-insensitive) in their string.

  3. logscale
    | groupBy(class)

    Takes the events extracted from the search and group them by the class field.

  4. Event Result set.

Summary and Results

The result is an aggregated count of all events matching anything with S3Bucket, case-insensitive, in the class field.

class_count
c.h.b.s.S3BucketStorageCleaningJob197
c.h.b.s.S3BucketStorageFileUpLoader2329
c.h.b.s.S3BucketStorageUploadJob3869
Searching S3Bucket with wildcard()

Figure 189. Search S3Bucket With wildcard()


Find Matches in Array Given a Regular Expression - Example 1

Use regular expressions to search for and match specific patterns in flat arrays

Query
logscale
array:regex("incidents[]", regex="^Cozy Bear.*")
| groupBy(host)
Introduction

In this example, the regular expression is used to search for patterns where the value Cozy Bear appears in a certain position across arrays.

Example incoming data might look like this:

hostincidents[0]incidents[1]incidents[2]
v1Evil BearCozy Bear 
v15Fancy FlyTiny CatCozy Bears
v22Fancy FlyTiny CatCold Bears
v4Fancy FlyTiny CatCozy Bearskins
v1Evil BearCozy Bears 
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    array:regex("incidents[]", regex="^Cozy Bear.*")

    Searches in the incidents array for values that only start with Cozy Bear. Find all matches given that regular expression.

  3. logscale
    | groupBy(host)

    Groups the returned results by host.

  4. Event Result set.

Summary and Results

The query using the regex expression are used to quickly search and return results for specific values in arrays. Regular expressions are useful when searching for different strings containing the same patterns; such as social security numbers, URLs, email addresses, and other strings that follow a specific pattern.

Sample output from the incoming example data:

host_count
v12
v151
v41

Find Matches in Array Given a Regular Expression - Example 2

Use regular expressions to search for and match specific patterns ignoring case in flat arrays

Query
logscale
array:regex("responses[]", regex="bear$", flags="i")
Introduction

In this example, the regular expression is used to search for patterns where the value bear appears at the end of a value in an array element, ignoring the case.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    array:regex("responses[]", regex="bear$", flags="i")

    Searches in the responses array for values that begins with bear, ignoring the case (due to the i flag).

  3. Event Result set.

Summary and Results

The queries using the regex expression are used to quickly search and return results for specific values in arrays. Regular expressions are useful when searching for different strings containing the same patterns; such as social security numbers, URLs, email addresses, and other strings that follow a specific pattern.

Find Minimum And Maximum Values of any Numerical Field in Session

Find minimum and maximum values of any numerical field in a session using the session() function

Query
logscale
groupBy(cookie_id, function=session([max(bet),min(bet)]))
Introduction

In this example, the session() function is used to find minimum and maximum values of the field bet in a session. The session() function groups events by a given timespan.

Example incoming data might look like this:

timestampcookie_idbetaction_typecategory
2025-05-15 05:30:00user12325.99purchaseelectronics
2025-05-15 05:32:00user12349.99purchaseelectronics
2025-05-15 05:34:00user12315.99purchaseaccessories
2025-05-15 05:48:00user12399.99purchaseappliances
2025-05-15 05:49:00user123150.00purchasefurniture
2025-05-15 05:35:00user45675.50purchaseclothing
2025-05-15 05:37:00user456199.99purchaseappliances
2025-05-15 05:40:00user45689.99purchaseelectronics
2025-05-15 05:30:00user78910.99purchasebooks
2025-05-15 05:55:00user78920.99purchasebooks
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(cookie_id, function=session([max(bet),min(bet)]))

    Groups events by the field cookie_id (unique user identifier) and creates sessions of 15-minutes timeout (the default value of the maxpause parameter), then calculates the maximun and minimum values of the field bet for each session, returning the results in new fields named _max and _min.

  3. Event Result set.

Summary and Results

The query is used to analyze the likelihood (the bet) of the behavior within user sessions. This query is, for example, useful for identifying if the event was an attempt to hack the system.

Sample output from the incoming example data:

cookie_id_max_min
user12349.9915.99 // First session
user123150.0099.99 // Second session
user456199.9975.50 // Single session
user78910.9910.99 // First session
user78920.9920.99 // Second session

Note that each session shows its own min/max values.

Get List of Status Codes

Get list of status codes returned and a count of each for a given period using the groupBy() function with count()

Query
logscale
groupBy(field=status, function=count())
Introduction

In this example, the groupBy() function is used to get a list of status codes for logged events. For instance, the status code 200 is returned when the request is successful, and 404 when the page is not found.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(field=status, function=count())

    Groups events by the status field, and counts the number of events in each group.

    It is possible to enhance the query for more detailed analysis. The following query example groups by both the fields status and source, limits to 1000 results, and sorts by count descending. groupBy([field=status, field=source], function=count(), limit=1000) | sort(_count, order=desc)

  3. Event Result set.

Summary and Results

The query is used to extract a list of status codes, each with a count of how many events have that status. The query is useful for summarizing and analyzing log data.

Sample output from the incoming example data:

status_count
10117
20046183
2043
3071
4002893
4014
Failure1
Success8633

Hourly Data Events

Summarize events by providing a count of the number of data events per hour using the time:hour() function

Query
logscale
hr := time:hour(field="@ingesttimestamp")
|groupBy(hr)
Introduction

In this example, the time:hour() function is used with groupBy() to average the count of data events per hour.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    hr := time:hour(field="@ingesttimestamp")

    Gets the hour (24-hour clock) of the values in the @ingesttimestamp and returns the results in a new field named hr.

  3. logscale
    |groupBy(hr)

    Groups the returned results by hr field and provides a count of the number of data events per hour in a _count field.

  4. Event Result set.

Summary and Results

The query is used to average the count of data events per hour. The results can be plotted onto a bar chart.

Preview Content in a Lookup File With readFile() and Filter With !join()

Preview content in a lookup file in the search portion of a repo and filter for specific data with the !join() function

Query
logscale
readFile("host_names.csv")
| !join(query={groupBy(host_name)}, field=host_name, key=host_name, include=[host_name, id])
Introduction

In this example, the readFile() function is used to look up a host_names.csv file, and then filter for host names that do not send any logs.

Example incoming data might look like this:

|--------------------|
| host_name, host_id |
| DESKTOP-VSKPBK8, 1 |
| FINANCE, 2         |
| homer-xubuntu, 3   |
| logger, 4          |
| DESKTOP-1, 5       |
| DESKTOP-2, 6       |
| DESKTOP-3, 7       |
|--------------------|
Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    readFile("host_names.csv")

    Displays the content of the .csv file.

  3. logscale
    | !join(query={groupBy(host_name)}, field=host_name, key=host_name, include=[host_name, id])

    Filters for host names that do not send any logs.

  4. Event Result set.

Summary and Results

The query is used to preview content in CSV Lookup Files, and then filter for host names that do not send any logs.

Sample output from the incoming example data:

host_idhost_name
5DESKTOP-1
6DESKTOP-2
7DESKTOP-3

Search Accross Multiple Structured Fields

Search across multiple structured fields using the transpose() function within groupBy()

Query
logscale
groupBy(@id, function=transpose())
| row[1] = /httpd/
| groupBy(column)
Introduction

By transposing event set, the information can be viewed and summarized in a more human readable form. In this example, the transpose() function is used within a groupBy() function to search across multiple structured fields in the HUMIO repository.

Example incoming data might look like this:

host@rawstring
MAIL012025-03-18T10:14:47.142Z MAIL01 httpd[61789]: 192.168.0.198 - - [2025-03-13:23:05:48 +0800] "GET /api/v2/products/search HTTP/1.1" 200 33456
LON-SRV012025-03-18T10:14:46.940Z LON-SRV01 httpd[60123]: 192.168.0.198 - - [2025-03-13:20:50:14 +0500] "GET /uploads/documents/terms.pdf HTTP/1.1" 401 36912
MAIL012025-03-18T10:14:46.691Z MAIL01 httpd[51234]: 192.168.0.198 - - [2025-03-13:12:50:16 -0300] "GET /downloads/mobile/app-v2.1.apk HTTP/1.1" 403 1234
SYD-SRV012025-03-18T10:14:46.542Z SYD-SRV01 httpd[45678]: 192.168.1.123 - - [2025-03-13:19:30:17 +0400] "GET /uploads/avatars/default.png HTTP/1.1" 404 61789
PROD-SQL012025-03-18T10:14:46.141Z PROD-SQL01 httpd[56789]: 192.168.1.245 - - [2025-03-13:17:30:38 +0200] "GET /uploads/avatars/default.png HTTP/1.1" 200 13456

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy(@id, function=transpose())

    Groups events by unique @id values, applies the transpose() function for each group, converting row values into column headers. A new row-based structure for each @id field is created.

    After using transpose(), the data might look like this:

    @idcolumnrow[1]
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886@timezoneZ
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886apphttpd[56789]:
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886hostPROD-SQL01
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886,msg, 192.168.1.245 - - [2025-03-13:17:30:38 +0200] "GET /uploads/avatars/default.png HTTP/1.1" 200 13456  
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886priority34
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886version1

  3. logscale
    | row[1] = /httpd/

    Filters for events where row[1] regex matches the value httpd.

    After filtering, the data might look like this (@rawstring has been removed from the below for clarity):

    @idcolumnrow[1]
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_520_1742292886apphttpd[56789]:
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_528_1742292886apphttpd[45678]:
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_531_1742292886apphttpd[51234]:
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_536_1742292886apphttpd[60123]:
    xhCUZ4fQzxEbr3qoyzLIQqSE_0_540_1742292887apphttpd[61789]:
  4. logscale
    | groupBy(column)

    Groups results by the column field, showing which original fields contained the value httpd, and makes a count of matches per field, returning the counted results in a field named _count. The final groupBy(column) removes duplicate entries.

  5. Event Result set.

Summary and Results

The query is used to search across multiple structured fields in the HUMIO repository to show where httpd appears most often. It makes results more readable, identifies field patterns, and is very useful for statistical analysis.

Sample output from the incoming example data:

column_count
@rawstring5
app5

Sort Timestamps With groupBy()

Sorting fields based on aggregated field values

Query

Search Repository: humio

logscale
timestamp := formatTime(format="%H:%M")
| groupBy([thread],
function=[{sort("timestamp")
| collect("timestamp")}])
Introduction

When using aggregation, you may want to sort on a field that is part of the aggregated set but not the main feature of the aggregated value. For example, sorting the values by their timestamp rather than the embedded value. To achieve this, you should use a function that sorts the field to be used as the sort field, and then use collect() so that the value from before the aggregation can be displayed in the generated event set. This query can be executed in the humio respository.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    timestamp := formatTime(format="%H:%M")

    Creates a new field, timestamp formatted as HH:MM.

  3. logscale
    | groupBy([thread],

    Groups the events, first by the name of the thread and then the formatted timestamp.

  4. logscale
    function=[{sort("timestamp")
    | collect("timestamp")}])

    Uses the sort() combined with collect() as the method fo aggregation. As an embedded expression for the function, this will sort the events on the timestamp field and then retrieve the field as it would normally be removed as part of the aggregation process.

  5. Event Result set.

Summary and Results

The result set will contain a list of the aggregated thread names sorted by the timestamp:

threadtimestamp
BootstrapInfoJob10:09
DataSynchJob10:09
Global event loop10:10
LocalLivequeryMonitor10:09
LogCollectorManifestUpdate10:09
TransientChatter event loop10:10
aggregate-alert-job10:09
alert-job10:09
block-processing-monitor-job10:09
bloom-scheduler10:09
bucket-entity-config10:09
bucket-overcommit-metrics-job10:09
bucket-storage-download10:09
bucket-storage-prefetch10:09
chatter-runningqueries-logger10:09
chatter-runningqueries-stats10:09

Transpose a Basic Table

Transposing an event set effectively switches rows (each event) into columns (an event)

Query
logscale
groupBy([loglevel])
| transpose(header=loglevel)
| drop(column)
Introduction

By transposing event set, the information can be viewed and summarized in a more human readable form. Transposing also allows for aggregated data to be viewed in a form where the value of an aggregated field becomes the columns. This can be used to summarize the information further by showing multiple rows of data. For example, in the humio repository the event information contains a list of errors, warnings, or informational events for activity within the cluster.

Step-by-Step
  1. Starting with the source repository events.

  2. logscale
    groupBy([loglevel])

    Aggregates the lows by the loglevel. This field will either be WARN, ERROR or INFO depending on the level of the event log entry. The default function is used, which results in a count in the number of times an event of this type has been seen:

    loglevel _count
    ERROR 27424
    INFO 18840156
    WARN 2059898

    In this output, each event is a row in the table, each with two fields, loglevel and _count.

  3. logscale
    | transpose(header=loglevel)

    Transposing the events within the transpose() will make each field in each event a row in the new stream of events, for example, the loglevel field with the value ERROR will become the field ERROR, swapping the rows for columns. By using the header parameter, transpose() uses the value of the aggregated field as the fieldname. The output will now be a table with a column (field) for each value, and a single row with the count:

    ERROR INFO WARN column
    97159 63719404 5716733 _count
  4. logscale
    | drop(column)

    In the final output, the column field in the events is the one generated from the field names of the original events and it's not needed, so it can be removed by using the drop() function to remove the field from the event set.

  5. Event Result set.

Summary and Results

The transpose() is a great way of reorganizing data into a format is either more readable, or easily applied to other display formats as part of a widget. The final table looks like this:

ERROR INFO WARN
97159 63719404 5716733

However, the information as it is now formatted can more easily be applied to a variety of visualizations. For example, the data can be formatted as a bar chart, as we now have a list of fields and a value:

The difference is that without transpose(), the aggregate result set is a list of events with a field name and value. With transpose(), the result set is a single event with multiple fields, and this is interpreted by the bar chart as a series of values.