This function is used to collate data from events that share a key. Often
the groupBy()
function can be used for this, but if
there are too many keys (defaulting to 100,000) then the result is
imprecise since some random subset of keys is left out of the result once
the limit is reached.
Parameter | Type | Required | Default Value | Description |
---|---|---|---|---|
collect | Array of strings | optional[a] | (none) | Specifies columns to include. |
field [b] | Array of strings | required | Specifies which field in the event (log line) that must match the given column value. | |
limit | number | optional[a] | 20000 | Specifies the maximum number of rows in the subquery |
Minimum | 1 | |||
Maximum | MAX_STATE_LIMIT | |||
postfilter | boolean | optional[a] | false | Re-run the and of the where clauses after collating results. If all fields needed for satisfying the where clauses are provided as values for collect, this will eliminate false positives in the output. |
prefilter | boolean | optional[a] | false | Only pass values matching at least one of the where clauses into the embedded groupBy() . |
select | Array of strings | optional[a] | (none) | Specifies columns to include. |
where | [Filter] | required | The subquery to execute producing the values to join with. | |
[a] Optional parameters use their default value unless explicitly set. |
Hide omitted argument names for this function
Omitted Argument NamesThe argument name for
field
can be omitted; the following forms of this function are equivalent:logscaleselfJoin("field",where="value")
and:
logscaleselfJoin(field="field",where="value")
These examples show basic structure only.
Say you want to run a query like this:
groupby(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
If there are many email_ids then the state size is exhausted. Even if you rewrite it like this to:
from=peter OR to=anders
| groupby(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
because either there are many emails from Peter or to Anders.
With selfJoin()
you specify a join key (the
field
argument), and a
series of tests (the where
clauses); it will then perform the above operation in a two-phase way so
that only those log lines with an email_id
for which there exists both a from=peter
event and a to=anders
event are passed
into the groupBy()
. This is done in a probabilistic
fashion, by using a bloom filter.
The embedded groupBy()
then uses either
collect()
or selectLast()
on the
specified fields depending on the value of one of the
select
or
collect
parameter to the
selfJoin()
function.
The above query can be simplied (and be made more efficient) by using the
selfJoin()
function:
selfJoin(field=email_id, where=[{from=peter},{to=anders}])
selfJoin()
limits the number of matching join keys to
what is configured in GroupMaxLimit
dynamic configuration.
Default is 20,000. These limits apply to the number of join key values
that satisfy the collective requirements of the
where
clauses. In this case,
the function would return the number of emails from Peter to Anders.
Important
selfJoin()
is probabilistic and the result can
contain false positives.
matches | false positive rate | number of false positives |
---|---|---|
1000 | 0.00000% | 0.0 |
10000 | 0.00029% | 0.0 |
20000 | 0.00224% | 0.4 |
25000 | 0.00434% | 1.1 |
50000 | 0.03289% | 16.4 |
If for example the where
clauses (along with any preceding filtering) limit the matching IDs to
25,000 elements, then 1.1 of those will be false positives on average.
Warning
The selfJoin()
does two passes over the data and
can therefore not run truly live.
selfJoin()
Examples
Select emails from an eventstream where each email is split into one
event per header, such as the IronPort
email log. Here, we find all emails that are from Peter to Anders.
selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])