This function is used to collate data from events that share a key. Often
the groupBy()
function can be used for this, but if
there are too many keys (defaulting to 100,000) then the result is
imprecise since some random subset of keys is left out of the result once
the limit is reached.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
collect | Array of strings | optional[a] | (none) | Specifies columns to include. |
field [b] | Array of strings | required | Specifies which field in the event (log line) that must match the given column value. | |
limit | number | optional[a] | 20000 | Specifies the maximum number of rows in the subquery |
Minimum | 1 | |||
Maximum | MAX_STATE_LIMIT | |||
postfilter | boolean | optional[a] | false | Re-run the and of the where clauses after collating results. If all fields needed for satisfying the where clauses are provided as values for collect, this will eliminate false positives in the output. |
prefilter | boolean | optional[a] | false | Only pass values matching at least one of the where clauses into the embedded groupBy() . |
select | Array of strings | optional[a] | (none) | Specifies columns to include. |
where | [Filter] | required | The subquery to execute producing the values to join with. | |
[a] Optional parameters use their default value unless explicitly set |
Omitted Argument NamesThe argument name for
field
can be omitted; the following forms of this function are equivalent:logscaleselfJoin("field",where="value")
and:
logscaleselfJoin(field="field",where="value")
These examples show basic structure only; full examples are provided below.
Say you want to run a query like this:
groupby(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
If there are many email_ids then the state size is exhausted. Even if you rewrite it like this to:
from=peter OR to=anders
| groupby(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
because either there are many emails from Peter or to Anders.
With selfJoin()
you specify a join key (the
field
argument), and a series of tests (the
where
clauses); it will then essentially perform the
above operation in a two-phase way so that only those log lines with an
email_id
for which there exists both a
from=peter
event and a
to=anders
event are passed into the
groupBy()
. This is done in a probabilistic fashion,
by using a bloom filter.
The embedded groupBy()
then does either
collect
or selectLast
on the specified
fields depending on the value of one of the select
or collect
parameter to the
selfJoin()
function. As such, the above complex query
can be done at scale using something like:
selfJoin(field=email_id, where=[{from=peter},{to=anders}])
selfJoin()
limits the number of matching join keys to
what is configured in
GroupMaxLimit
dynamic
configuration. Default is 20,000. These limits apply to the number of join
key values that satisfy the collective requirements of the
where
clauses. Say, in the example of
emails it would be the number of emails from Peter to Anders.
selfJoin()
is probabilistic and the result can
contain false positives.
matches | false positive rate | number of false positives |
---|---|---|
1000 | 0.00000% | 0.0 |
10000 | 0.00029% | 0.0 |
20000 | 0.00224% | 0.4 |
25000 | 0.00434% | 1.1 |
50000 | 0.03289% | 16.4 |
If for example the where
clauses (along
with any preceding filtering) limit the matching IDs to 25,000 elements,
then 1.1 of those will be false positives on average.
Warning
The selfJoin()
does two passes over the data and
can therefore not run truly live.
selfJoin()
Examples
Select emails from an eventstream where each email is split into one
event per header, such as the IronPort
email log. Here, we find all emails that are from Peter to Anders.
selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])