Joins

Humio supports the joining of queries using the join() function. It will return a combined results set from two queries it’s given.

This function provides two join modes: inner and left. If you’re familiar with SQL database systems, what would be the left table in SQL is referred to as the primary in Humio, and the right table becomes the subquery.

Inner Joins

Figure 1, Inner Join

An inner join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery. This is illustrated in purple in Figure 1 here. Incidentally, the inner join is the join() function’s default behavior.

As an example, suppose we’re analyzing two repositories to see what kinds of web browsers are generating errors on our web servers. In this example, we want to return the intersection of two types of data: the status codes that return errors (anything 400 or higher) from the accesslog in the current repository, and the web browser type (the user agent) from the weblogs repository. To do this, we can use the following query:

humio
#type=accesslog statuscode>=400
| join({useragent != ""}, field=sourceipaddress, key=browseripaddress, repo=weblogs)

The primary query here is the first line. It says to search the current working repository for logs of type accesslog, where statuscode >= 400.

Next, it pipes the primary query to the join() function. It passes in {useragent != ""} as the subquery argument to the join() function. This subquery returns any value for the useragent that is not an empty string.

To join the result sets of these two queries together we will need a way to match them. If our accesslog and weblogs both contain IP addresses, that would be a good option and is the one we would use in this example. However, in this example the two repositories use different field names for the IP address. The accesslog uses sourceipaddress, while the weblogs repository uses browseripaddress. This query defines that relationship by passing in the field used in the primary query, and the key that will be matched against in the subquery—i.e., field=sourceipaddress, and key=browseripaddress.

To define the external repository that the subquery will run against, the query here simply passes that repository’s name in with the repo parameter, repo=weblogs.

As for the results, the primary query will find all accesslog entries with an error code. The subquery will find all entries in the weblogs repository with a value present for the useragent. However, being an inner join, the combined results will only be returned for entries where the IP addresses match in both queries.

Left Joins

Figure 2, Left Join

The left join returns everything in the primary and anything that matches from the subquery. In other words, a query using a left join returns any intersection of matching values, based upon the criteria you provide, between the primary query and the subquery (like an inner join) plus all the values of the primary query. This is illustrated below.

As an example of a left join, suppose we want to get as much information as possible about people who are accessing hosts. The information we have may or may not contain the location of the user, but if it does, we need to include that in our query results.

Since location information is desirable, although optional, but all access logs are required a left join query would better fit our requirements than an inner join query. Here is what it would look like

humio
#type=accesslog
| join({username=*}, field=host, key=hostname, repo=auditlogs, mode=left)
| table(timestamp, username, host, location)

In this example, we get all results from our current repository using #type=accesslog as the primary query. Our subquery (username=*) states that we wish to return all usernames. As in the previous example, we need a common point of data to map the relationship between our repositories. We’ll use the host, again, defining the differing field names used in the two repositories for that data. We passed those into the join() function as field=host, key=hostname. We define our external repository with repo=auditlogs.

Since we want all the results from the primary query, even if there is no match with the subquery, we specify a left join using mode=left.

Finally, we display the results in a table by piping the results to table(timestamp, username, host, location).

Syntax

The join() function’s signature looks like this

humio
join({subquery}, field=arg1, key=arg2, repo=arg3, mode=inner|left)

The unnamed subquery argument, and the named field and key arguments, are required. The named repo, include, and mode parameters are optional. There are other options available. Please see the Join Query Functions on join() reference page for more details.

First, let’s consider a join query against a single repository. In that scenario, there is no need to define the optional repo parameter. The query will look like this

humio
primary query | join({subquery}, field=arg1, key=arg2)

So to break down how the join() function is used

  • The primary query is any valid Humio query. This provides the first set of results you wish to work with.

  • The arguments you pass to the join() function define what results should be combined and how.

  • The subquery is any valid Humio query. This provides the second set of results you wish to combine with those from your primary query.

  • The field parameter’s value (arg1) defines the field name in the primary query. This value will be used to match data in the subquery.

  • The key parameter’s value (arg2) defines the field name in the subquery. This value will be used to match data in the primary query.

If the field names being used in field and key are identical, you can use the simplified syntax

humio
primary query | field =~ join({subquery})

In the above example, the field argument would be the actual field name. For example, if both queries share a field named host that you wish to join on, you would use:

humio
primary query | host =~ join({subquery})

Joining Two Repositories

In addition to joining two queries against a single repository, the join() function can also be used to return a result set from more than two repositories. To do so, use the optional repo parameter to define the repository (as arg3) you want the subquery to run against.

humio
primary query
| join({subquery}, field=arg1, key=arg2, repo=arg3)

Note

Note that the user running the join() query must have permissions on both repositories.

Limitations of Live Joins

Joins are implemented by starting two queries: the main query and the subquery. For static queries, the subquery runs first and then the main query runs. For live queries, the historic part is done as in the static case, whereas the main query and subquery will run simultaneously for the live part.

Because of this and the distributed nature of Humio, subqueries in live queries are not guaranteed to “see” events before the main query does. In practice this can be thought of as there being a small delay (typically in the order of 5-6 seconds) from an event being ingested to it hitting the subquery. This can lead to surprising results when making live joins on events that occur very close in time.

For example, a query that contains only the count() function will typically result in a larger count than the query below, which joins events to themselves:

humio
join({*}, field=id) | count()

This is because at the time the main query runs for a given event, the subquery has not seen that event yet and the join will therefore not emit the event. You can avoid this by not joining events that are ingested close in time, using static queries, or by using repeating queries().

Best Practices for Join Queries

Humio recommends the following when running join() queries

  • If your join() query key is constantly changing with live data, use live queries to keep the query updated.

  • When debugging an inner join query, temporarily set the join mode to left (join(mode=left)). This will help you debug your query by showing all results matched or unmatched by the join clause.

  • Another suggestion is to first write the subquery alone to make sure the results are what you expect. Once they are, then you can confidently include it in a join.