This function is used to collate data from events that share a
key. Often the groupBy()
function can be
used for this, but if there are too many keys (defaulting to
100,000) then the result is imprecise since some random subset
of keys is left out of the result once the limit is reached.
Parameter | Type | Required | Default Value | Description |
---|---|---|---|---|
collect | array of strings | optional[a] | (none) | Specifies columns to include. |
field [b] | array of strings | required | Specifies which field in the event (log line) that must match the given column value. | |
limit | number | optional[a] | 20000 | Specifies the maximum number of rows in the subquery |
Minimum | 1 | |||
Maximum | MAX_STATE_LIMIT | |||
postfilter | boolean | optional[a] | false | Re-run the and of the where clauses after collating results. If all fields needed for satisfying the where clauses are provided as values for collect, this will eliminate false positives in the output. |
prefilter | boolean | optional[a] | false | Only pass values matching at least one of the where clauses into the embedded groupBy() . |
select | array of strings | optional[a] | (none) | Specifies columns to include. |
where | [filter] | required | The subquery to execute producing the values to join with. | |
[a] Optional parameters use their default value unless explicitly set. |
Say you want to run a query like this:
groupBy(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
If there are many email_ids then the state size is exhausted. Even if you rewrite it like this to:
from=peter OR to=anders
| groupBy(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
because either there are many emails from Peter or to Anders.
With selfJoin()
you specify a join key (the
field
argument),
and a series of tests (the
where
clauses); it
will then perform the above operation in a two-phase way so that
only those log lines with an
email_id
for which there exists
both a from=peter
event and a
to=anders
event are passed into
the groupBy()
. This is done in a
probabilistic fashion, by using a bloom filter.
The embedded groupBy()
then uses either
collect()
or
selectLast()
on the specified fields
depending on the value of one of the
select
or
collect
parameter
to the selfJoin()
function.
The above query can be simplied (and be made more efficient) by
using the selfJoin()
function:
selfJoin(field=email_id, where=[{from=peter},{to=anders}])
selfJoin()
limits the number of matching
join keys to what is configured in GroupMaxLimit
dynamic configuration. Default is 20,000. These limits apply to
the number of join key values that satisfy the collective
requirements of the
where
clauses. In
this case, the function would return the number of emails from
Peter to Anders.
Important
selfJoin()
is probabilistic and the
result can contain false positives.
matches | false positive rate | number of false positives |
---|---|---|
1000 | 0.00000% | 0.0 |
10000 | 0.00029% | 0.0 |
20000 | 0.00224% | 0.4 |
25000 | 0.00434% | 1.1 |
50000 | 0.03289% | 16.4 |
If, for example, the
where
clauses
(along with any preceding filtering) limit the matching IDs to
25,000 elements, then 1.1 of those will be false positives on
average.
Warning
The selfJoin()
does two passes over the
data and can therefore not run truly live.
selfJoin()
Examples
Click
next to an example below to get the full details.Filter and Collect Values in Same Table
Retrieves all emails sent from one given person to another given person using the selfJoin()
function