How-To: On Correlating Events
So you've learned to use LogScale - you can find those events of interest, you can aggregate metrics and display them in time charts and pie charts and whatnot, and you feel a certain level of mastery of the query language.
But then you run into a situation where you need to correlate two events.
Not just find one event, and not just compute some average or other aggregate value across a set of similar events, but to link two individual events together.
And you may feel that you've possibly run into the next stage of the learning curve.
And you may be right. This is at the very least an area where there is a lot of "it depends".
For these kinds of questions, there is usually an underlying decision about whether an actual join query is needed, or whether it can be solved by other means. There are some trade-offs, which I will come back to. For now, let's just note that for the first steps, we'll avoid actual join queries and see how far we can go without.
The simple cases — hidden aggregates
In certain use cases, what we're looking for is two events which are each "the most extreme" in some fashion. In those cases, it's still just a case of aggregating over events — just not with simple numerical aggregators over values.
An example of such a question: "I need to find the length of each session — from login to logout" (for instance in order to find the shortest).
In this case, if we were looking at just one session, for which we knew the session ID, we could find the session length like this:
sessionID=123456
| [min(@timestamp), max(@timestamp)]
| length := _max - _min
or equivalent:
sessionID=123456
| range(@timestamp, as=length)
Doing this for all sessions is a matter of grouping these aggregations by the session ID:
groupBy(sessionID, function=[min(@timestamp), max(@timestamp)])
| length := _max - _min
"Hang on," you say, "we want to explicitly look at login and logout events, not just any events which happen to have a sessionID field."
Right, sorry — let's ensure that we capture the right events. We
do this by putting filters on the event type before the
min()
and max()
functions
— different filters, so they can't go before the
groupBy()
. This is what subqueries or "anonymous
functions" are for:
groupBy(sessionID, function=[
{ eventType="Login"
| min(@timestamp, as=loginTime) },
{ eventType="Logout"
| max(@timestamp, as=logoutTime) }
])
| length :=logoutTime - loginTime
This is more precise and even lets us detect whether we see any logouts-before-logins for the same session ID.
The groupBy()
function, in fact, is a very
powerful tool for correlating events whenever there is some common key
which can be used to "bring the events together".
How data flows
It may be worthwhile to look a bit more at that last query, to understand how data flows.
This diagram shows what happens:
![]() |
Figure 12. Data Flows
With groupBy()
,
it is always the same field(s) which is used as the group key.
once an event has been assigned to a group, all sub-aggregators see the event.
in cases like this, though, further filtering has been added in those aggregators.
the results from different sub-aggregators are combined.
This is like an "outer join": if one aggregator outputs a result and the other doesn't — say if a minimum is found but no maximum — then a row is still being output — in this case, one with a _min field but no _max.
If one aggregator outputs multiple rows, then each of those is combined with the output from the other aggregator to form an output row.
Annotating an event
But say you're not interested in a simple aggregate value. You're interested in whole events, with some fields added from some other event, the two kinds of events linked by some key field.
To make it concrete: Occasionally in some user session, we get an InterestingEvent and would like to know some context — say, the IP address of the client— but that context is only logged in connection with the Login event.
To a certain extent, we can still use the same technique. For the
aggregator for the
InterestingEvent case, we
use tail()
— which collects the last N
events.
groupBy(sessionID, function=[
{ eventType="InterestingEvent"
| tail(100) },
{ eventType="Login"
| selectLast(clientIP) }
])
| eventType="InterestingEvent"
Here we collect up to 100
InterestingEvents per
session. When using groupBy()
to correlate
events, there must be some limit
to what we gather; whether that's a problem depends on the use case.
If you must always find all relevant events, you may need an actual
join query.
By the way, what's up with the final filter on eventType?
Well, what groupBy()
constructs is an
outer join, so it would still
output rows for entities which were spawned but which never emitted
any InterestingEvents
— that is, where the first (tail()
)
aggregator had found nothing, but the second
(selectLast()
) had. Such rows would therefore
only contain the fields
sessionID and
clientIP.
A more general way to ensure we get an inner join — or left or right, for that matter — is to simply count each kind of event, then filter on the counts:
groupBy(sessionID, function=[
{ eventType="InterestingEvent"
| [tail(100), count(as=c1)] },
{ eventType="Login"
| [selectLast(clientIP), count(as=c2)] }
])
| c1 > 0
| c2 > 0
This can also be used to answer questions like "has event type X happened more than once in a single session?".
Common key, different fields
Let's try a slight variation: there is some ID we can correlate by,
but groupBy()
does not seem applicable because
the ID happens to be in
different
fields in the two events we're interested in connecting.
More concretely, let's say our domain includes some tree-like structure where entities have parent-child relations; let's say the parent logs some Spawn event with myID and childID fields; and that the child logs some InterestingEvent event with a myID field.
We'd like to know, for any InterestingEvent event, what the parent ID was.
Obviously we can't simply group by myID, because the link we're interested in is between a childID in one event and a myID in another event.
We will still use groupBy()
, but we need to
establish a usable key first:
eventType match {
Spawn => key := childID;
InterestingEvent => key := myID;
}
| groupBy(key, function=[
{ eventType="InterestingEvent"
| tail(100) },
{ eventType="Spawn"
| selectLast(myID)
| rename(myID, as=parentID) }
])
| eventType="InterestingEvent"
Finding outliers
Another common kind of query is: "Find actions whose response times were more than twice the average", or some variation like "...more than two standard deviations above the normal".
This can be solved using a conventional join — first find the relevant aggregate values (average and perhaps standard deviation) — but it can in many cases also be solved with a one-pass query.
First off, let's note that whichever events we'll find, they'll be those with the largest response times. So if our final report is a list of up to 200 instances of "slow actions", we can collect just the 200 slowest actions, and then check them against whichever threshold we compute.
So we're going to collect three things: the average time, the standard deviation, and the 200 slowest instances:
[
avg(responsetime),
stddev(responsetime),
sort(responsetime, limit=200)
]
| threshold := _avg + 2 * _stddev
| test(responsetime > threshold)
Here we have an example of combining aggregators, one of which may output multiple rows — in which case each of those rows is combined with the result of the other aggregators (here _avg and _stddev values) to form a full output row.
This query can be extended to, for instance, different thresholds per
action type, by putting the aggregation inside a
groupBy(actionType)
.
What to know about join queries
Enough dodging already. I've so far shown a number of examples of how
to combine information in different events using multiple aggregators
and groupBy()
, and stated that "sometimes you can
do that" and that there are limits but that's sometimes OK.
Let's get to the fuller picture.
In particular, let's get join()
into the picture.
But first...
The Stream and the State
Experienced users of LogScale may know that there is a qualitative difference between the first part of a query and the last part. And that the boundary goes exactly at the first aggregating function.
Before that function, we have the Stream.
From that stream, the first aggregator catches the State. (You can think of it as a kind of fish.)
And after that aggregator, we take the State and do further stuff.
The key point is that we start with something unmanageable, and we can't take all of it with us further. We need to distill it somehow, into something of a manageable size.
Now what bearing does this have on join queries?
It's what we could call the Ghostbuster Rule of log searching:
Ghostbuster Rule:You Can't Cross Streams.
That is: you can join a state and a state — all of the queries so far have done exactly that — and you can join a stream and a state. But you can't join two streams.
Now, back to joins.
How join works
The way join functions work in LogScale is remarkably simple: First you do one thing, then you do another.
More concretely: first you visit one Stream, to eventually leave with some State. Then you visit another Stream (obviously another one, because you can't visit the same stream twice), bringing your acquired State to guide your search there.
Scaling
This already tells us something about using
join()
: whichever part-query we do first must
result in a State of sufficiently small size (and smaller is usually
better).
Thus, if we want to correlate a rare kind of event with a common kind of event, we're better off searching for the rarer kind first.
And if neither of the two things we want to correlate are rare in themselves — say, we are a provider of a streaming video service and want to find users who suspiciously have watched both Star Trek and Star Wars, neither of which would in itself be a rare event as such — then things get further interesting.
Liveness
By the way, the "first do one thing, then do another" also tells us that live join queries are not generally possible — those are examples of trying to join two streams (unless special circumstances apply, at least).
Join examples
Getting back from theory to practice, let's look at how to write join queries.
As things stand, when using join()
there are
three main constraints:
The inner query is run first, which means that its result is a State; there are therefore limits on its size.
From the results of the inner query, only selected fields are carried over to the join result.
Inner join and left join modes are supported (the left component is the stream part); outer and right joins are not.
Let's revisit the parent-child relationship query, and let's assume that InterestingEvent is rare — but that we want to include all, even if many occur in the same session — while Spawn events are relatively common.
Then, putting the query part with the highest specificity
(filtering-ness) inside join()
, and making up a
number of fields we want to capture, we get:
eventType="Spawn"
| select([childID, myID]) // The join key, and whatever we want to include
| join({ eventType="InterestingEvent" },
field=childID, key=myID, // Specifying the join keys.
include=[@timestamp, macguffin], // Fields to include from InterestingEvent
max=100000 // We want all events, not just one per ID.
)
The potentially unsatisfying part of this query is that we'd perhaps like to know about "InterestingEvent"s even if we find no corresponding Spawn event — that is, we'd actually like a right join rather than an inner join.
Outlier detection as join query
Let's now revisit another query: the one where we found response times higher than a certain threshold.
Suppose we'd like to know exactly how many such instances there are — then our method of just finding the N worst won't suffice.
Starting with the variant where we have separate thresholds per actionType, a join query could look like this:
join(
{ groupBy(actionType, function=[avg(responsetime), stddev(responsetime)])
| threshold := _avg + 2 * _stddev
}, // The output of this is a table with "actiontype" and "threshold" columns
key=actionType,
include=[threshold]
)
| test(responsetime > threshold)
| groupBy(actionType, function=count())
If we don't want separate
thresholds, just one global one, we run into a problem:
join()
does not (at the moment at least) allow
for there to be zero join keys.
That's not too much of a problem though — we just add a dummy key field. It may look a bit silly, but it gets the job done:
dummyKey := ""
| join(
{ [avg(responsetime), stddev(responsetime)]
| threshold := _avg + 2 * _stddev
| dummyKey := ""
},
key=dummyKey,
include=[threshold]
]
| test(responsetime > threshold)
| count()
Rare correlation of individually common events
Now that we know a bit about what to do when one of the things to join is a) rare, b) has a fixed set of fields we're interested in carrying over to the join result, and c) is something it's OK we only get to see if it's correlated with one or more events in the other join party — how, then, to proceed when that isn't the case?
To illustrate one tool which may come to the rescue, let's look at a question where the problem is that neither party to the join is a sufficiently small result set that it fits into State constraints: finding streaming-video users which, suspiciously, have been watching both Star Wars and Star Trek.
What we can do in this case is to use two passes over the data (like in a join query), but to use the first pass merely to get a good idea about which join keys are in fact relevant.
Then, in the second pass, we gather the data with a
groupBy()
as shown earlier — with the
flexibility that entails — but because we know a bit about the
key set, we won't need to keep track of as many groups.
The new tool here is a function called
selfJoinFilter()
. Let's see it in action
— in its idiomatic place, as a filter before a
groupBy()
:
selfJoinFilter(userID,
where=[{videoID="Star Trek"}, {videoID="Star Wars"}],
prefilter=true
)
| groupBy(userID, function=[
{ videoID="Star Trek"
| [count(as=c1), max(@timestamp, as=lastWatched1)] },
{ videoID="Star Wars"
| [count(as=c2), max(@timestamp, as=lastWatched2)] }
])
| c1 > 0
| c2 > 0
It should be noted that even this method begins to struggle if, for both types of events, there are many keys (here, users) which have that event type.
Some conclusions
That just about concludes our tour of the join-adjacent LogScale query landscape.
As far as correlating events go, here are some take-aways:
groupBy()
is often your friend.Pros:
It is one-phase (and therefore usable in live queries).
It is quite flexible, and can support all kinds of join modes.
Cons:
It creates a group for all keys it encounters, and can therefore not take into account any knowledge of rare vs common event kinds.
This can lead to pressure on the state size.
join()
is also a good tool to know.Pros:
It is great when one of the join parties leads to a small result set.
Cons:
It is two-phase (which among other things means it's not usable in live queries).
It has certain constraints, and the two join parties are not treated symmetrically.
Some join modes are not supported.
When neither of the join parties lead to a small enough result set, there is no good way to use join.
Sometimes,
selfJoinFilter()
can be a help in reducing the state size.