Join Operation and Optimization

When using join() and selfJoin():

  • Query Execution and Performance

    When executing a join that involves queries (i.e. using join()), LogScale executes the overall query inside out. That is, the sub-query is executed first in order to create the event dataset that is then used to match against the primary query.

    For best performance, the sub-query of the join operation should be significantly smaller than the primary query. This can be archived by doing a lot of filtering in the sub-query to restrict the search results. With a smaller sub-query, the number of lookups and matches against the query on the left will be reduced resulting in better performance.

  • Negating Joins

    Joins can be negated, so that the query expolicitly returns the rows that dont create a valid match. For example, when performing the inner join using the weblog and authentication repositories, the result did not include the weblgo entry that was missing a authenticated user. To get the rows that dont provide a match, use a negated join:

    logscale
    type=weblog
    | !join({type="authentication"},field=authuser,key=user,mode=inner)

    This creates an result set that includes weblog entries without a valid authenticated user:

    authuseripaddrtypeurl
    palvarez10.0.0.84weblogmap.html
    <no value>10.0.0.43weblogindex.html
  • Debugging Inner Joins

    When debugging an inner join query, temporarily set the join mode to left (join(mode=left)). The query will then output the full query set including rows that do not match the inner join clause.

  • Joins and Live Queries

    Using joins in live queries is not recommended. When a join is executed, each query pair (primary and sub query) are executed indepedently. The results may not be computed consistently across the two queries, with the queries are executed against the data being ingested which may have changed between executing each query.

    Because the datasets from the primary query and sub-query are inconsistent, the results of the join are also inconsistent.

    The result is that the two queries are run repeatedly rather than being executed as a typical live query which will be less efficient that within a typical live query execution.

    This execution style is different to a typical SQL database, where the set of rows in the query will be configured to be in repeatable read, so the set of rows is consistent each time.

    For this reason, using join() within an alert is not recommended. A warning will be produced if LogScale identifies that a query contains a join.