How-To: On Correlating Events

So you've learned to use LogScale - you can find those events of interest, you can aggregate metrics and display them in time charts and pie charts and whatnot, and you feel a certain level of mastery of the query language.

But then you run into a situation where you need to correlate two events.

Not just find one event, and not just compute some average or other aggregate value across a set of similar events, but to link two individual events together.

And you may feel that you've possibly run into the next stage of the learning curve.

And you may be right. This is at the very least an area where there is a lot of "it depends".

For these kinds of questions, there is usually an underlying decision about whether an actual join query is needed, or whether it can be solved by other means. There are some trade-offs, which I will come back to. For now, let's just note that for the first steps, we'll avoid actual join queries and see how far we can go without.

The simple cases — hidden aggregates

In certain use cases, what we're looking for is two events which are each "the most extreme" in some fashion. In those cases, it's still just a case of aggregating over events — just not with simple numerical aggregators over values.

An example of such a question: "I need to find the length of each session — from login to logout" (for instance in order to find the shortest).

In this case, if we were looking at just one session, for which we knew the session ID, we could find the session length like this:

logscale

sessionID=123456
| [min(@timestamp), max(@timestamp)]
| length := _max - _min

or equivalent:

logscale

sessionID=123456
| range(@timestamp, as=length)

Doing this for all sessions is a matter of grouping these aggregations by the session ID:

logscale

groupBy(sessionID, function=[min(@timestamp), max(@timestamp)])
| length := _max - _min

"Hang on," you say, "we want to explicitly look at login and logout events, not just any events which happen to have a sessionID field."

Right, sorry — let's ensure that we capture the right events. We do this by putting filters on the event type before the min() and max() functions — different filters, so they can't go before the groupBy(). This is what subqueries or "anonymous functions" are for:

logscale

groupBy(sessionID, function=[
  { eventType="Login" 
| min(@timestamp, as=loginTime) },
  { eventType="Logout" 
| max(@timestamp, as=logoutTime) }
])
| length :=logoutTime - loginTime

This is more precise and even lets us detect whether we see any logouts-before-logins for the same session ID.

The groupBy() function, in fact, is a very powerful tool for correlating events whenever there is some common key which can be used to "bring the events together".

How data flows

It may be worthwhile to look a bit more at that last query, to understand how data flows.

This diagram shows what happens:

Figure 12. Data Flows

With groupBy(),

it is always the same field(s) which is used as the group key.
once an event has been assigned to a group, all sub-aggregators see the event.
in cases like this, though, further filtering has been added in those aggregators.
the results from different sub-aggregators are combined.
This is like an "outer join": if one aggregator outputs a result and the other doesn't — say if a minimum is found but no maximum — then a row is still being output — in this case, one with a _min field but no _max.
If one aggregator outputs multiple rows, then each of those is combined with the output from the other aggregator to form an output row.

Annotating an event

But say you're not interested in a simple aggregate value. You're interested in whole events, with some fields added from some other event, the two kinds of events linked by some key field.

To make it concrete: Occasionally in some user session, we get an InterestingEvent and would like to know some context — say, the IP address of the client— but that context is only logged in connection with the Login event.

To a certain extent, we can still use the same technique. For the aggregator for the InterestingEvent case, we use tail() — which collects the last N events.

logscale

groupBy(sessionID, function=[
  { eventType="InterestingEvent" 
| tail(100) },
  { eventType="Login" 
| selectLast(clientIP) }
])
| eventType="InterestingEvent"

Here we collect up to 100 InterestingEvents per session. When using groupBy() to correlate events, there must be some limit to what we gather; whether that's a problem depends on the use case. If you must always find all relevant events, you may need an actual join query.

By the way, what's up with the final filter on eventType?

Well, what groupBy() constructs is an outer join, so it would still output rows for entities which were spawned but which never emitted any InterestingEvents — that is, where the first (tail()) aggregator had found nothing, but the second (selectLast()) had. Such rows would therefore only contain the fields sessionID and clientIP.

A more general way to ensure we get an inner join — or left or right, for that matter — is to simply count each kind of event, then filter on the counts:

logscale

groupBy(sessionID, function=[
  { eventType="InterestingEvent" 
| [tail(100), count(as=c1)] },
  { eventType="Login" 
| [selectLast(clientIP), count(as=c2)] }
])
| c1 > 0 
| c2 > 0

This can also be used to answer questions like "has event type X happened more than once in a single session?".

Common key, different fields

Let's try a slight variation: there is some ID we can correlate by, but groupBy() does not seem applicable because the ID happens to be in different fields in the two events we're interested in connecting.

More concretely, let's say our domain includes some tree-like structure where entities have parent-child relations; let's say the parent logs some Spawn event with myID and childID fields; and that the child logs some InterestingEvent event with a myID field.

We'd like to know, for any InterestingEvent event, what the parent ID was.

Obviously we can't simply group by myID, because the link we're interested in is between a childID in one event and a myID in another event.

We will still use groupBy(), but we need to establish a usable key first:

logscale

eventType match {
  Spawn => key := childID;
  InterestingEvent => key := myID;
}
| groupBy(key, function=[
  { eventType="InterestingEvent" 
| tail(100) },
  { eventType="Spawn" 
| selectLast(myID) 
| rename(myID, as=parentID) }
])
| eventType="InterestingEvent"

Finding outliers

Another common kind of query is: "Find actions whose response times were more than twice the average", or some variation like "...more than two standard deviations above the normal".

This can be solved using a conventional join — first find the relevant aggregate values (average and perhaps standard deviation) — but it can in many cases also be solved with a one-pass query.

First off, let's note that whichever events we'll find, they'll be those with the largest response times. So if our final report is a list of up to 200 instances of "slow actions", we can collect just the 200 slowest actions, and then check them against whichever threshold we compute.

So we're going to collect three things: the average time, the standard deviation, and the 200 slowest instances:

logscale

[
  avg(responsetime),
  stddev(responsetime),
  sort(responsetime, limit=200)
]
| threshold := _avg + 2 * _stddev
| test(responsetime > threshold)

Here we have an example of combining aggregators, one of which may output multiple rows — in which case each of those rows is combined with the result of the other aggregators (here _avg and _stddev values) to form a full output row.

This query can be extended to, for instance, different thresholds per action type, by putting the aggregation inside a groupBy(actionType).

What to know about join queries

Enough dodging already. I've so far shown a number of examples of how to combine information in different events using multiple aggregators and groupBy(), and stated that "sometimes you can do that" and that there are limits but that's sometimes OK.

Let's get to the fuller picture.

In particular, let's get join() into the picture.

But first...

The Stream and the State

Experienced users of LogScale may know that there is a qualitative difference between the first part of a query and the last part. And that the boundary goes exactly at the first aggregating function.

Before that function, we have the Stream.

From that stream, the first aggregator catches the State. (You can think of it as a kind of fish.)

And after that aggregator, we take the State and do further stuff.

The key point is that we start with something unmanageable, and we can't take all of it with us further. We need to distill it somehow, into something of a manageable size.

Now what bearing does this have on join queries?

It's what we could call the Ghostbuster Rule of log searching:

Ghostbuster Rule:You Can't Cross Streams.

That is: you can join a state and a state — all of the queries so far have done exactly that — and you can join a stream and a state. But you can't join two streams.

Now, back to joins.

How join works

The way join functions work in LogScale is remarkably simple: First you do one thing, then you do another.

More concretely: first you visit one Stream, to eventually leave with some State. Then you visit another Stream (obviously another one, because you can't visit the same stream twice), bringing your acquired State to guide your search there.

Scaling

This already tells us something about using join(): whichever part-query we do first must result in a State of sufficiently small size (and smaller is usually better).

Thus, if we want to correlate a rare kind of event with a common kind of event, we're better off searching for the rarer kind first.

And if neither of the two things we want to correlate are rare in themselves — say, we are a provider of a streaming video service and want to find users who suspiciously have watched both Star Trek and Star Wars, neither of which would in itself be a rare event as such — then things get further interesting.

Liveness

By the way, the "first do one thing, then do another" also tells us that live join queries are not generally possible — those are examples of trying to join two streams (unless special circumstances apply, at least).

Join examples

Getting back from theory to practice, let's look at how to write join queries.

As things stand, when using join() there are three main constraints:

The inner query is run first, which means that its result is a State; there are therefore limits on its size.
From the results of the inner query, only selected fields are carried over to the join result.
Inner join and left join modes are supported (the left component is the stream part); outer and right joins are not.

Let's revisit the parent-child relationship query, and let's assume that InterestingEvent is rare — but that we want to include all, even if many occur in the same session — while Spawn events are relatively common.

Then, putting the query part with the highest specificity (filtering-ness) inside join(), and making up a number of fields we want to capture, we get:

logscale

eventType="Spawn"
| select([childID, myID]) // The join key, and whatever we want to include
| join({ eventType="InterestingEvent" },
  field=childID, key=myID, // Specifying the join keys.
  include=[@timestamp, macguffin], // Fields to include from InterestingEvent
  max=100000 // We want all events, not just one per ID.
)

The potentially unsatisfying part of this query is that we'd perhaps like to know about "InterestingEvent"s even if we find no corresponding Spawn event — that is, we'd actually like a right join rather than an inner join.

Outlier detection as join query

Let's now revisit another query: the one where we found response times higher than a certain threshold.

Suppose we'd like to know exactly how many such instances there are — then our method of just finding the N worst won't suffice.

Starting with the variant where we have separate thresholds per actionType, a join query could look like this:

logscale

join(
  { groupBy(actionType, function=[avg(responsetime), stddev(responsetime)])
| threshold := _avg + 2 * _stddev
  }, // The output of this is a table with "actiontype" and "threshold" columns
  key=actionType,
  include=[threshold]
)
| test(responsetime > threshold)
| groupBy(actionType, function=count())

If we don't want separate thresholds, just one global one, we run into a problem: join() does not (at the moment at least) allow for there to be zero join keys.

That's not too much of a problem though — we just add a dummy key field. It may look a bit silly, but it gets the job done:

logscale

dummyKey := ""
| join(
  { [avg(responsetime), stddev(responsetime)]
    
| threshold := _avg + 2 * _stddev
    
| dummyKey := ""
  },
  key=dummyKey,
  include=[threshold]
]
| test(responsetime > threshold)
| count()

Rare correlation of individually common events

Now that we know a bit about what to do when one of the things to join is a) rare, b) has a fixed set of fields we're interested in carrying over to the join result, and c) is something it's OK we only get to see if it's correlated with one or more events in the other join party — how, then, to proceed when that isn't the case?

To illustrate one tool which may come to the rescue, let's look at a question where the problem is that neither party to the join is a sufficiently small result set that it fits into State constraints: finding streaming-video users which, suspiciously, have been watching both Star Wars and Star Trek.

What we can do in this case is to use two passes over the data (like in a join query), but to use the first pass merely to get a good idea about which join keys are in fact relevant.

Then, in the second pass, we gather the data with a groupBy() as shown earlier — with the flexibility that entails — but because we know a bit about the key set, we won't need to keep track of as many groups.

The new tool here is a function called selfJoinFilter(). Let's see it in action — in its idiomatic place, as a filter before a groupBy():

logscale

selfJoinFilter(userID,
  where=[{videoID="Star Trek"}, {videoID="Star Wars"}],
  prefilter=true
)
| groupBy(userID, function=[
  { videoID="Star Trek" 
| [count(as=c1), max(@timestamp, as=lastWatched1)] },
  { videoID="Star Wars" 
| [count(as=c2), max(@timestamp, as=lastWatched2)] }
  ])
| c1 > 0 
| c2 > 0

It should be noted that even this method begins to struggle if, for both types of events, there are many keys (here, users) which have that event type.

Some conclusions

That just about concludes our tour of the join-adjacent LogScale query landscape.

As far as correlating events go, here are some take-aways:

groupBy() is often your friend.
Pros:
- It is one-phase (and therefore usable in live queries).
- It is quite flexible, and can support all kinds of join modes.
Cons:
- It creates a group for all keys it encounters, and can therefore not take into account any knowledge of rare vs common event kinds.
  This can lead to pressure on the state size.
join() is also a good tool to know.
Pros:
- It is great when one of the join parties leads to a small result set.
Cons:
- It is two-phase (which among other things means it's not usable in live queries).
- It has certain constraints, and the two join parties are not treated symmetrically.
- Some join modes are not supported.
- When neither of the join parties lead to a small enough result set, there is no good way to use join.
Sometimes, selfJoinFilter() can be a help in reducing the state size.

Knowledge Base