selfJoin()

`selfJoin()`

This function is used to collate data from events that share a key. Often the groupBy() function can be used for this, but if there are too many keys (defaulting to 100,000) then the result is imprecise since some random subset of keys is left out of the result once the limit is reached.

Parameter	Type	Required	Default Value	Description
`collect`	array of strings	optional^[a]	`(none)`	Specifies columns to include.
`field`^[b]	array of strings	required		Specifies which field in the event (log line) that must match the given column value.
`limit`	number	optional^[a]	`20000`	Specifies the maximum number of rows in the subquery
		Minimum	`1`
		Maximum	`MAX_STATE_LIMIT`
		Controlling Variables
		`GroupMaxLimit`		Variable default: `1,000,000 group elements`
		`MAX_STATE_LIMIT`
`postfilter`	boolean	optional^[a]	`false`	Re-run the `and` of the `where` clauses after collating results. If all fields needed for satisfying the `where` clauses are provided as values for collect, this will eliminate false positives in the output.
`prefilter`	boolean	optional^[a]	`false`	Only pass values matching at least one of the `where` clauses into the embedded `groupBy()`.
`select`	array of strings	optional^[a]	`(none)`	Specifies columns to include.
`where`	[filter]	required		The subquery to execute producing the values to join with.
^[a]Optional parameters use their default value unless explicitly set. ^[b]The parameter name `field` can be omitted.

Hide omitted argument names for this function

Show omitted argument names for this function

Omitted Argument Names
The argument name for field can be omitted; the following forms of this function are equivalent:
logscale Syntax
selfJoin(["value"],where="value")
and:
logscale Syntax
selfJoin(field=["value"],where="value")
These examples show basic structure only.

`selfJoin()` Function Operation

Say you want to run a query like this:

logscale

groupBy(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders

If there are many email_ids then the state size is exhausted. Even if you rewrite it like this to:

logscale

from=peter OR to=anders
| groupBy(field=email_id, function=[collect(from), collect(to)])
| from=peter AND to=anders

because either there are many emails from Peter or to Anders.

With selfJoin() you specify a join key (the field argument), and a series of tests (the where clauses); it will then perform the above operation in a two-phase way so that only those log lines with an email_id for which there exists both a from=peter event and a to=anders event are passed into the groupBy(). This is done in a probabilistic fashion, by using a bloom filter.

The embedded groupBy() then uses either collect() or selectLast() on the specified fields depending on the value of one of the select or collect parameter to the selfJoin() function.

The above query can be simplied (and be made more efficient) by using the selfJoin() function:

logscale

selfJoin(field=email_id, where=[{from=peter},{to=anders}])

selfJoin() limits the number of matching join keys to what is configured in GroupMaxLimit dynamic configuration. Default is 20,000. These limits apply to the number of join key values that satisfy the collective requirements of the where clauses. In this case, the function would return the number of emails from Peter to Anders.

Important

selfJoin() is probabilistic and the result can contain false positives.

matches	false positive rate	number of false positives
1000	0.00000%	0.0
10000	0.00029%	0.0
20000	0.00224%	0.4
25000	0.00434%	1.1
50000	0.03289%	16.4

If, for example, the where clauses (along with any preceding filtering) limit the matching IDs to 25,000 elements, then 1.1 of those will be false positives on average.

Warning

The selfJoin() does two passes over the data and can therefore not run truly live.

`selfJoin()` Examples

Click + next to an example below to get the full details.

Filter and Collect Values in Same Table

Retrieves all emails sent from one given person to another given person using the selfJoin() function

Query

logscale

selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])

Introduction

In this example, emails are logged with one event for each header (each email has its own ID) and the selfJoin() function is used to find and collect all emails sent from one given person to another given person. Notice, that this query does two passes over the data and, therefore, cannot be used in a live query.

Step-by-Step

Starting with the source repository events.
logscale
```
selfJoin(email_id, where=[{from=*peter*}, {to=*anders*}], collect=[from,to])
```
Finds and collects all the values in the emails_id field that correspond to mails from Peter to Anders.
Event Result set.

Summary and Results

The query is used to find and collect all emails sent from one given person to another person. In general, the selfJoin() function is useful for narrowing down a set of events in a fairly efficient manner, in cases where the total set of events is too voluminous.

Versions of this Page

Data Analysis Overview

LogScale User Interface

Manage Repositories and Views

Parse Data

Search Data

Write Queries

Automation

Query Language Syntax

Query Functions

Template Language

Keyboard Shortcuts

`selfJoin()`

`selfJoin()` Function Operation

Important

`selfJoin()` Examples

Filter and Collect Values in Same Table

Query

Introduction

Step-by-Step

Summary and Results

Similar Content

Related Functions

Related KB Articles

Related Query Examples

Enter search term

Data Analysis Overview

LogScale User Interface

Manage Repositories and Views

Parse Data

Search Data

Write Queries

Automation

Query Language Syntax

Query Functions

Template Language

Keyboard Shortcuts

selfJoin() Function Operation

Important

selfJoin() Examples

Filter and Collect Values in Same Table

Query

Introduction

Step-by-Step

Summary and Results

Enter search term

`selfJoin()`

`selfJoin()` Function Operation

`selfJoin()` Examples