Joins

LogScale supports the joining of queries using the join() function. It will return a combined results set from two queries it's given.

This function provides two join modes: inner and left. If you're familiar with SQL database systems, what would be the left table in SQL is referred to as the primary in LogScale, and the right table becomes the subquery.

Inner Joins

Inner Joins

Figure 301. Inner Joins


An inner join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery. This is illustrated in purple in Figure 1 here. Incidentally, the inner join is the join() function's default behavior.

As an example, suppose we're analyzing two repositories to see what kinds of web browsers are generating errors on our web servers. In this example, we want to return the intersection of two types of data: the status codes that return errors (anything 400 or higher) from the accesslog in the current repository, and the web browser type (the user agent) from the weblogs repository. To do this, we can use the following query:

logscale
#type=accesslog statuscode>=400
| join({useragent != ""}, field=sourceipaddress, key=browseripaddress, repo=weblogs)

The primary query here is the first line. It says to search the current working repository for logs of type accesslog, where statuscode >= 400.

Next, it pipes the primary query to the join() function. It passes in {useragent != ""} as the subquery argument to the join() function. This subquery returns any value for the useragent that is not an empty string.

To join the result sets of these two queries together we will need a way to match them. If our accesslog and weblogs both contain IP addresses, that would be a good option and is the one we would use in this example. However, in this example the two repositories use different field names for the IP address. The accesslog uses sourceipaddress, while the weblogs repository uses browseripaddress. This query defines that relationship by passing in the field used in the primary query, and the key that will be matched against in the subquery — i.e., field=sourceipaddress, and key=browseripaddress.

To define the external repository that the subquery will run against, the query here simply passes that repository's name in with the repo parameter, repo=weblogs.

As for the results, the primary query will find all accesslog entries with an error code. The subquery will find all entries in the weblogs repository with a value present for the useragent. However, being an inner join, the combined results will only be returned for entries where the IP addresses match in both queries.

Left Joins

Left Joins

Figure 302. Left Joins


The left join returns everything in the primary and anything that matches from the subquery. In other words, a query using a left join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery (like an inner join) plus all the values of the primary query. This is illustrated below.

As an example of a left join, suppose we want to get as much information as possible about people who are accessing hosts. The information we have may or may not contain the location of the user, but if it does, we need to include that in our query results.

Since location information is desirable, although optional, but all access logs are required a left join query would better fit our requirements than an inner join query. Here is what it would look like

logscale
#type=accesslog
| join({username=*}, field=host, key=hostname, repo=auditlogs, mode=left)
| table(timestamp, username, host, location)

In this example, we get all results from our current repository using #type=accesslog as the primary query. Our subquery (username=*) states that we wish to return all usernames. As in the previous example, we need a common point of data to map the relationship between our repositories. We'll use the host, again, defining the differing field names used in the two repositories for that data. We passed those into the join() function as field=host, key=hostname. We define our external repository with repo=auditlogs.

Since we want all the results from the primary query, even if there is no match with the subquery, we specify a left join using mode=left.

Finally, we display the results in a table by piping the results to table(timestamp, username, host, location).

Join Syntax

The join() function's signature looks like this

logscale
join({subquery}, field=arg1, key=arg2, repo=arg3, mode=inner|left)

The unnamed subquery argument, and the named field and key arguments, are required. The named repo, include, and mode parameters are optional. There are other options available. Please see the Join Query Functions reference page for more details.

First, let's consider a join query against a single repository. In that scenario, there is no need to define the optional repo parameter. The query will look like this

logscale
primary query | join({subquery}, field=arg1, key=arg2)

So to break down how the join() function is used

  • The primary query is any valid LogScale query. This provides the first set of results you wish to work with.

  • The arguments you pass to the join() function define what results should be combined and how.

  • The subquery is any valid LogScale query. This provides the second set of results you wish to combine with those from your primary query.

  • The field parameter's value (arg1) defines the field name in the primary query. This value will be used to match data in the subquery.

  • The key parameter's value (arg2) defines the field name in the subquery. This value will be used to match data in the primary query.

If the field names being used in field and key are identical, you can use the simplified syntax

logscale
primary query | field =~ join({subquery})

In the above example, the field argument would be the actual field name. For example, if both queries share a field named host that you wish to join on, you would use:

logscale
primary query | host =~ join({subquery})

Joining Two Repositories

In addition to joining two queries against a single repository, the join() function can also be used to return a result set from more than two repositories. To do so, use the optional repo parameter to define the repository (as arg3) you want the subquery to run against.

logscale
primary query
| join({subquery}, field=arg1, key=arg2, repo=arg3)

Note

Note that the user running the join() query must have permissions on both repositories.

Limitations of Live Joins

Joins are implemented by starting two queries: the main query and the subquery. For static queries, the subquery runs first and then the main query runs. For live queries, the historic part is done as in the static case, whereas the main query and subquery will run simultaneously for the live part.

LogScale's notions of liveness and joins are not compatible, meaning that the results of live queries using the join() function may become incorrect over time. Instead of running these queries fully live, they are internally repeated at an interval determined by the server, which can lead to a delay on new results that depend on the resources the query uses.

Important

You should avoid using join() functions in Alerts, as they fail to detect some matches and therefore would not trigger the alert as expected. Instead, try using Scheduled Searches.

Support for using join() functions in alerts may be removed in future versions.

Best Practices for Join Queries

LogScale recommends the following when running join() queries.

  • If your join() query key is constantly changing with live data, you should be aware that using live queries with the join() function has some significant limitations, the reasons are explained at Limitations of Live Joins and at join().

  • When debugging an inner join query, temporarily set the join mode to left (join(mode=left)). This will help you debug your query by showing all results matched or unmatched by the join clause.

  • First write the subquery alone to make sure the results are what you expect. Once they are, then you can confidently include it in a join.