This function is used to collate data from events that share a key. Often
the groupBy()
function can be used for this, but if
there are too many keys (defaulting to 100,000) then the result is
imprecise since some random subset of keys is left out of the result once
the limit is reached.
Parameter | Type | Required | Default Value | Description |
---|---|---|---|---|
collect | Array of strings | optional[a] | (none) | Specifies columns to include. |
field [b] | Array of strings | required | Specifies which field in the event (log line) that must match the given column value. | |
limit | number | optional[a] | 20000 | Specifies the maximum number of rows in the subquery |
Minimum | 1 | |||
Maximum | MAX_STATE_LIMIT | |||
postfilter | boolean | optional[a] | false | Re-run the and of the where clauses after collating results. If all fields needed for satisfying the where clauses are provided as values for collect, this will eliminate false positives in the output. |
prefilter | boolean | optional[a] | false | Only pass values matching at least one of the where clauses into the embedded groupBy() . |
select | Array of strings | optional[a] | (none) | Specifies columns to include. |
where | [Filter] | required | The subquery to execute producing the values to join with. | |
[a] Optional parameters use their default value unless explicitly set. |
Hide omitted argument names for this function
Omitted Argument NamesThe argument name for
field
can be omitted; the following forms of this function are equivalent:logscaleselfJoin("field",where="value")
and:
logscaleselfJoin(field="field",where="value")
These examples show basic structure only.
Say you want to run a query like this:
groupby(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
If there are many email_ids then the state size is exhausted. Even if you rewrite it like this to:
from=peter OR to=anders
| groupby(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders
because either there are many emails from Peter or to Anders.
With selfJoin()
you specify a join key (the
field
argument), and a
series of tests (the where
clauses); it will then perform the above operation in a two-phase way so
that only those log lines with an email_id
for which there exists both a from=peter
event and a to=anders
event are passed
into the groupBy()
. This is done in a probabilistic
fashion, by using a bloom filter.
The embedded groupBy()
then uses either
collect()
or selectLast()
on the
specified fields depending on the value of one of the
select
or
collect
parameter to the
selfJoin()
function.
The above query can be simplied (and be made more efficient) by using the
selfJoin()
function:
selfJoin(field=email_id, where=[{from=peter},{to=anders}])
selfJoin()
limits the number of matching join keys to
what is configured in GroupMaxLimit
dynamic configuration.
Default is 20,000. These limits apply to the number of join key values
that satisfy the collective requirements of the
where
clauses. In this case,
the function would return the number of emails from Peter to Anders.
Important
selfJoin()
is probabilistic and the result can
contain false positives.
matches | false positive rate | number of false positives |
---|---|---|
1000 | 0.00000% | 0.0 |
10000 | 0.00029% | 0.0 |
20000 | 0.00224% | 0.4 |
25000 | 0.00434% | 1.1 |
50000 | 0.03289% | 16.4 |
If, for example, the where
clauses (along with any preceding filtering) limit the matching IDs to
25,000 elements, then 1.1 of those will be false positives on average.
Warning
The selfJoin()
does two passes over the data and
can therefore not run truly live.
selfJoin()
Examples
Filter and Collect Values in Same Table
Retrieves all emails sent from one given person to another given person using the selfJoin()
function
Query
selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])
Introduction
The selfJoin()
function is a join query that
matches events across the same event sets.
selfJoin()
joins an event set to itself and
allows you to combine events from the same table based on two
fields in the same event. In order to do this, the event set must
have a common field with a unique ID, a primary field, and a
secondary (or subquery) field that will be matched to each other.
In this example, emails are logged with one event for each header
(each email has its own ID) and the
selfJoin()
function is used to find and
collect all emails sent from one given person to another given
person. Notice, that this query does two passes over the data and,
therefore, cannot be used in a live query.
Step-by-Step
Starting with the source repository events.
- logscale
selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])
Finds and collects all the values in the emails_id field that correspond to mails from
Peter
toAnders
. Event Result set.
Summary and Results
The query is used to find and collect all emails sent from one
given person to another person. In general, the
selfJoin()
function is useful for narrowing
down a set of events in a fairly efficient manner, in cases
where the total set of events is too voluminous.