selfJoin() Query Function

This function is used to collate data from events that share a key. Often the groupby function can be used for this, but if there are too many keys (defaulting to 20000) then the result is imprecise since some random subset of keys is left out of the result once the limit is reached.

Say you really want to run a query like this:

humio
groupby(field=email_id, function=[collect(from), collect(to)]) | from=peter AND to=anders

If there are many email_ids then the state size is exhausted. Even if you rewrite it like this to:

humio
from=peter OR to=anders | groupby(field=email_id, function=[collect(from), collect(to)]) | from=peter AND to=anders

because either there are many emails from peter or to anders.

With selfJoin() you specify a join key (the field argument), and a series of tests (the where clauses); it will then essentially perform the above operation in a two-phase way so that only those log lines with an email_id for which there exists both a from=peter event and a to=anders event are passed into the groupby. This is done in a probabilistic fashion, by using a bloom filter.

The embedded groupby then does either collect or selectLast on the specified fields depending on the value of one of the select or collect parameter to the selfJoin function. As such, the above complex query can be done at scale using something like:

humio
selfJoin(field=email_id, where=[{from=peter},{to=anders}])

SelfJoin() limits the number of matching join keys to what is configured in MAX_STATE_LIMIT. Default is 20000. These limits apply to the number of join key values that satisfy the collective requirements of the where clauses. Say, in the example of emails it would be the number of emails from peter to anders.

SelfJoin() is probabilistic and the result can contain false positives.

matches

false positive rate

number of false positives

1000

0.00000%

0.0

10000

0.00029%

0.0

20000

0.00224%

0.4

25000

0.00434%

1.1

50000

0.03289%

16.4

If for example the where clauses (along with any preceding filtering) limit the matching IDs to 25,000 elements, then 1.1 of those will be false positives on average.

This function does two passes over the data and can therefore not be used in a live query unless in combination with beta:repeating().

Parameters

Name

Type

Required

Default

Description

where

[Filter]

Yes

The subquery to execute producing the values to join with.

field

[string]

Yes

Specifies which field in the event (log line) that must match the given column value.

select

[string]

No

Specifies columns to include. Default to none.

collect

[string]

No

Specifies columns to include. Default to none.

limit

number

No

20000

Specifies the maximum number of rows in the subquery 1..MAX_STATE_LIMIT

prefilter

boolean

No

false

Only pass values matching at least one of the where clauses into the embedded groupby.

postfilter

boolean

No

false

Re-run the AND of the where clauses after collating results. If all fields needed for satisfying the where clauses are provided as values for collect, this will eliminate false positives in the output.

The implied parameter is field.

Examples

Select emails from an eventstream where each email is split into one event per header, such as the IronPort email log. Here, we find all emails that are from peter to anders.

humio
selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])