How-To: Using, Combining, and Building Aggregators

Using LogScales query language is a skill that involves many areas. There are two aspects which are particularly important. One is knowing how to filter data effectively - this is often important for making queries run fast. The other is understanding aggregators and knowing how to use them well. This is important because it lets you express solutions to a broader range of problems.

In this article, you will learn about aggregators, starting with simple aggregation functions, then you will learn how to combine and build more complex aggregators.

What is an Aggregator?

First, let us define what an aggregator (and an aggregation) is. How does it differ from, for example, a filter or a transformation step in a query? An aggregation is something that describes, in some way, a whole set of events. Some aggregations are known from maths, especially statistics: the number of events, the sum of a certain field's values, the average of its values, or the 90th percentile of its values. These are all numeric aggregators: given a set of input events, they output a single number:

There are, however, more aggregators in addition to these prototypical ones. What they have in common is that they describe a whole set of input events. Another group of aggregators is the one which finds the extreme values in some form of other. This group includes:

And also:

  • sort(), tail(), head() - head/tail are really "sort ascending/descending by timestamp".

  • table() - which is "sort() but gather only certain fields from the relevant events".

These really find the "top N events" but potentially for a rather large value of N. Sometimes though, they're used with the intention of collecting all events they see. A warning is then given if they run out of space while trying to do so.

Related, there's an aggregator which collects (within limits) all the distinct values it sees for a single field: collect().

Composite Aggregators

But aggregators can be more complex; there's more to it than just a finite list of aggregating functions. Some functions are not complete by themselves, but always work together with others, like connectable building blocks. These are called composite aggregators. The next group of aggregators you'll learn about work in a similar way: they analyze at each event, determines what "group" it goes into, and then aggregate all events within each group. They have as a parameter the function that determines what kind of aggregation should be done within each group - this is called a "sub-aggregator". Some very useful functions belong to this group:

They differ mainly in how they group: in the case of groupBy(), the value(s) of one or more fields is used as a grouping key. In the case of bucket() and timeChart(), the groups are different time intervals, so that events belonging to the same "time bucket" are grouped together. The functions bucket() and timeChart() also support a "series" field as a separate grouping key; if such a key is provided, then events are grouped together if they belong to the same time bucket and they have the same series-field value.

Here's a diagram depicting how that works for a groupBy() with max() as its sub-aggregator:

groupBy() with max() as its sub-aggregator

Combining Aggregators Horizontally

But there are other ways to combine aggregators. A simple and useful one is to compute more than one aggregation over the same set of data. For instance, you may want the minimum, maximum and average value of a field, or an average and a 90th percentile, or you may want the sum of "transfer duration" and the sum of "bytes transferred" in order to compute an overall transfer speed. This is done with the [...] syntax: you list the relevant aggregator within square brackets. The examples just mentioned, for instance, could be expressed as:

logscale
[min(y), max(y), avg(y)]
[avg(y), percentile(y, percentiles=[90])]
[
  sum(transferDuration, as=totalDuration),
  sum(transferSize, as=totalSize)
]
| averageTransferSpeed := totalSize / totalDuration

Here's a diagram of the data flow:

Combining Aggregators Horizontally

Since the sub-results are combined as one row, you must be careful to avoid field name collisions. For instance, [min(x), min(y)] would be problematic, since both sub-aggregators output a field named _min, but the output row can only have one field called that.

The [...] syntax is really just shorthand for the stats() function. Most of the time it's more convenient to use the shorthand syntax. For instance, stats([min(y), max(y)]) is the same as just [min(y), max(y)].

Many functions which take sub-aggregators - such as groupBy() and bucket() - accept not just one, but a list of sub-aggregators. These are combined in the same way as described above (unless the function in question says otherwise).

Handling Multi-Row Outputs

The way "horizontal combination" of aggregators work is easy enough to understand when all of the sub-aggregators output exactly one row as a result. But do things work when this is not the case?

Zero-row outputs

First, what happens if one of the sub-aggregators does not output any rows? Then that sub-aggregator is just ignored.

More than one row of output

If an aggregator outputs more than one row of output, then each row is combined with the output of the other aggregators. This can, for instance, be used to enrich individual events with some summary statistics about all of the events, which can be useful. For instance:

You want to count how often each value of the field x occurs, and compute how large a percentage that is of all events:

logscale
[
  groupBy(x, function=count()),
  count(as=total)
]
| percent := 100 * _count / total

You want to know the 10 worst latencies, and also how many standard deviations they are from the average of all the measured latencies:

logscale
[
  sort(latency, order=desc, limit=10),
  avg(latency),
  stdDev(latency)
]
| stddevsAboveAverage = (latency - _avg) / _stdDev

This feature is usually used with only one of the aggregators being of the "can output more than one row" type. It is possible to combine multiples of this kind of aggregator in this way, in which case you get the combination of all rows of each, with all rows of the others, but it is not recommended to do so. (This possibility may be restricted, or supported in a different way, in the future.)

Subqueries as Aggregators

There is one final way to compose aggregators from smaller building blocks. Suppose you want to count two different things - green screwdrivers and blue hammers, say. (We don't care about any other kind of tool.) Doing either in isolation is easy enough:

logscale
toolType="screwdriver" color="green" | count(as=greenScrewdrivers)

and:

logscale
toolType="hammer" color="blue" | count(as=blueHammers)

But how do you do both simultaneously?

You can't put any of the filters before the aggregator, because if you filter on color="green" you throw away anything with color="blue", and even if we filter on color="green" OR color="blue" we still need some way to specify which events go into which count.

The point is that you can build an aggregator from an aggregating function plus a filter, by placing them in a subquery (in curly braces). So you can get your two counts simultaneously like this:

logscale
[
  { toolType="screwdriver" color="green"
    | count(as=greenScrewdrivers)
  },
  { toolType="hammer" color="blue"
    | count(as=blueHammers)
  }
]

Counting different things is just a special (though often useful) case of this.

Here's how it works:

Subqueries as aggregators

You could also use this to count the percentage of tools which are blue hammers:

logscale
toolType = *
| [
    { toolType="hammer" color="blue"
      | count(as=blueHammers)
    },
    count(as = total)
  ]
| blueHammerPct := 100 * blueHammers / total
More Complex Subqueries

The above examples use subqueries to just put some filters before an aggregator. However, you can put any mix of filters and transformation steps before or after the aggregator. You can even have multiple aggregation steps.

Let's look at some examples. Suppose you have a set of events, and you want some statistics about the text in those events. The most basic statistic is the event count:

logscale
count(as=eventCount)

Next, you also want to know the total number of characters in the event. You add a subquery with a transformation step:

logscale
[
  count(as=eventCount),
  { len := length(@rawstring) | sum(len, as=totalEventLength) }
]

Actually, you also want to know something about the longest event: what is the maximal number of characters in a single event? Well, you compute the maximum as well as the sum:

logscale
[
  count(as=eventCount),
  { len := length(@rawstring)
    | [sum(len, as=totalEventLength), max(len, as=maxEventLength)]
  }
]

As it turns out, the maximum event length might look unusual. So you decide to include the actual text of the longest event using selectFromMax():

logscale
[
  count(as=eventCount),
  { len := length(@rawstring)
    | [
        sum(len, as=totalEventLength),
        max(len, as=maxEventLength),
        selectFromMax(len, include=@rawstring)
      ]
  }
]

Actually, the output column is named @rawstring, which doesn't really describe what its contents are. You could of course rename that column after the aggregator, but you can do it closer to the source. What you need to do is add a rename() step after the selectFromMax():

logscale
[
  count(as=eventCount),
  { len := length(@rawstring)
    | [
        sum(len, as=totalEventLength),
        max(len, as=maxEventLength),
        { selectFromMax(len, include=@rawstring)
          | rename(@rawstring, as=longestEvent)
        }
      ]
  }
]

At this point it should be clear that subqueries can provide flexibility, and that the collection of aggregator functions provided by LogScale really are building blocks which can be combined in many ways.

You've seen examples of how to:

  • Combine aggregators "horizontally" (getting multiple aggregation results in the same output row).

  • Filter before an aggregator, within a subquery.

  • Make transformations before an aggregator (the computation of the event length).

  • Make transformations after an aggregator (the renaming of the longestEvent output).

There are still other effects to cover, though.

Suppose that the event size analysis above turned out to not really be what you wanted; it actually seems like a word level analysis may be what you really want. How many words are there in total? What's the longest one? How do you go from an event to a list of words? You can use a regex to extract all the words in each event:

logscale
@rawstring=/(?<word>\w+)/g

Then you can do the analysis. Assuming that you still want the event count too, for instance to compute "average word count per event" statistics. Here's an example:

logscale
[
  count(as=eventCount),
  { @rawstring=/(?<word>\w+)/g
  | len := length(word)
  | [
      count(as=wordCount),
      sum(len, as=totalWordLength),
      max(len, as=maxWordLength),
      avg(len, as=averageWordLength),
      {selectFromMax(len, include=word) | rename(word, as=longestWord)}
    ]
  }
]

It's much like the event-level analysis above; the main difference is that you use a query step which turns one event into multiple items for the aggregation step to process. Another scenario in which this is relevant is if you have an array and want to aggregate across all of the array elements; you can do that either with split() before an aggregator, or using array:reduceAll().

Multiple Aggregation Steps

A final effect is using more than one aggregator within a subquery. Suppose, for instance, that you want to include in your word analysis the number of unique words - the size of the vocabulary. This can be achieved approximately using count(word, distinct=true), but if you want the precise count you can do it like this:

logscale
groupBy(word, function=[])
| count(as=uniqueWords)

Similarly, you can determine the most common word by first counting each, then selecting the top one:

logscale
groupBy(word)
| selectFromMax(_count, include=word)
| rename(word, as=mostCommonWord)

Or you could find out how many words only occur once in your input:

logscale
groupBy(word)
| _count = 1
| count(as=wordsOnlyUsedOnce)

You can combine it all in one single composite aggregator:

logscale
[
  count(as=eventCount),
  { len := length(@rawstring)
    | [
        sum(len, as=totalEventLength),
        max(len, as=maxEventLength),
        { selectFromMax(len, include=@rawstring)
          | rename(@rawstring, as=longestEvent)
        }
      ]
  },
  { @rawstring=/(?<word>\w+)/g
  | len := length(word)
  | [
      count(as=wordCount),
      sum(len, as=totalWordLength),
      max(len, as=maxWordLength),
      avg(len, as=averageWordLength),
      { selectFromMax(len, include=word)
        | rename(word, as=longestWord)
      },
      { groupBy(word)
      | [
          count(as=uniqueWords),
          { _count = 1 | count(as=wordsOnlyUsedOnce) },
          { selectFromMax(_count, include=[word, _count])
            | rename([[word, mostCommonWord],
                      [_count, mostComonWordCount]])
          }
        ]
      }
    ]
  }
]

Conclusion

LogScale comes with a range of aggregator functions, both those known from statistics and others which deal not just with numbers but with more general values or whole events. But much of the real expressive power comes when you use those as building blocks for building your own composite aggregators.

You can collect more than one aggregation simultaneously using the [...] construct (or the stats() function); you can partition the input events into groups with groupBy(), bucket() or timeChart(). For certain advanced effects you can use aggregating subqueries which enables you to combine aggregation steps with filtering and transformation steps, and even to put multiple aggregators in sequence.

While much of LogScale query writing is putting one query step after another in sequence, that is not the only way of combining features.