FAQ: How to handle ingest delays in aggregate alerts and scheduled searches

Note

Filter alerts handle ingest delay automatically, so the principles in this article do not apply to them.

Ingest delays

Events are produced in an external system some time before they are sent to LogScale, which means there is a delay coming from outside of LogScale. This ingest delay is the difference between event and ingest timestamps:

  • event timestamp: available in the @timestamp field, is when the event is produced that is, when it actually happens outside LogScale.

  • ingest timestamp: available in the @ingesttimestamp field, is when the event arrives in LogScale.

When an event is ingested into LogScale, it takes some time before it is available for searches. This is the ingest delay inside LogScale. There is no field on events reflecting the time when a log is available for searching.

There are two metrics that are helpful when considering ingest delays outside and inside LogScale: external-ingest-delay and event-latency-repo. The metric external-ingest-delay shows the delay between log creation (@timestamp) and LogScale ingestion (@ingesttimestamp) for all events in a single repository within a 5 minute interval; in other words, the ingest delay outside of LogScale. If your triggers run on a view connected to multiple repositories, you need to aggregate the metric over all of the repositories.

The event-latency-repo metric shows ingest delay inside LogScale for a single repository, including time spent in parsers, updating live queries, and adding events to blocks for segment files. If you use this metric for triggers running on a view, you must aggregate the metric over all repositories to which the view is connected.

Queries and ingest delays

For automated queries in general, there is a tradeoff between waiting for the full ingest delay to ensure a complete and correct query result, and fast results by acting on a non-empty query result as soon as it is available, no matter if it is complete.

The UI allows you to configure whether the query should run on @ingesttimestamp (when the event comes to LogScale) or on @timestamp (when the event happens outside). Read more about choosing the most appropriate timestamp for a trigger at Trigger properties.

Normally, LogScale queries run on @timestamp, so they will be affected by ingest delays both outside and inside of LogScale. If, for example, the combined ingest delay is maximum 20 minutes, that means that the last 20 minutes of the search interval can be missing events.

Queries running on @ingesttimestamp will only be affected by the ingest delay inside LogScale, but otherwise have the same problem.

Ingest delays in aggregate alerts

When it comes to running queries for Aggregate alerts in particular, LogScale handles alert triggers facing ingest delays as follows:

  • The way in which an aggregate alert chooses to trigger is set by the GraphQL API: it can be Immediate Mode or Complete Mode.

    • Immediate Mode means: the alert is triggered immediately even on incomplete results but will still wait up to 20 minutes for a result to trigger on; this will ensure that events that are up to 20 minutes delayed can still trigger the alert. With this mode, results might potentially be incomplete.

    • Complete Mode means: the alert waits for the internal ingest delay (no matter what timestamp is used) up to a maximum of 20 minutes before triggering, until there is a complete result to trigger on. When running the query on @ingesttimestamp, this will ensure complete results as long as the internal ingest delay is below 20 minutes.

  • Depending on the timestamp that has been configured in the UI, LogScale chooses a triggering mode by default:

    • When the alert runs on @timestamp, the default is Immediate Mode.

    • When the alert runs on @ingesttimestamp, the default is Complete Mode.

    The default triggering strategy can be changed through the API, by manipulating the triggerMode field in the createAggregateAlert() or updateAggregateAlert() GraphQL mutations.

Based on whether CompleteMode or ImmediateMode is selected and the timestamp used for the aggregate alert, ingest delay is handled as follows:

Calculate ingest delay for scheduled searches and aggregate alerts

If the scheduled search or aggregate alert is running on @timestamp, you need to know the maximum ingest delay to be able to set an accurate value for Delay run and Time window on a scheduled search, or Time window on an aggregate alert. The following process describes how to find the maximum ingest delay.

Start by finding the maximum ingest delay for both ingest delay inside LogScale (event-latency-repo) and ingest delay outside LogScale (external-ingest-delay). To do this, run the following query in the humio-metrics repo using a search interval that covers a time period that is representative for the ingest delay of the cluster, changing the name for the metric to get delay for both:

logscale
name="event-latency-repo"
  | repo=<REPOSITORY>
  | max(max)

where you replace <REPOSITORY> with the name of the repository.

Once you have these two values, add them together to get the total maximum ingest delay. You will use this later in the process. If your query runs on a view connected to multiple repositories, you need to get the total maximum ingest delay of each repository.

Note

If you have multiple log sources in the same repository with different delays, and the alert query is only searching some log sources, then running the query below might give a more precise result. You only need to use if you don't want to use the value in the external-ingest-delay metric.

Take the filter part of your query and append it with:

logscale
@error!=true
delay:=@ingesttimestamp - @timestamp
max(delay)

The @error!=true line is to exclude events with parser errors, as those errors could be errors parsing the timestamp, causing @timestamp to be wrong and thus provide an incorrect calculation. For the aggregate function, use max(), avg(), or percentile() depending on which value you want to know (maximum delay, average delay, or percentile of delay).

So for example, take this query:

logscale
event_platform=win
  | event_simpleName=ProcessRollup2
  | FileName=cmd.exe
  | CommandLine=*whoami*
  | count(as=execution_count)
  | execution_count >= 5
  | sort(execution_count)
  | head(10)

and add the code snippet above to the filter part of the query, so that it is:

logscale
event_platform=win
| event_simpleName=ProcessRollup2
| FileName=cmd.exe
| CommandLine=*whoami*
| @error!=true
| delay:=@ingesttimestamp - @timestamp
| max(delay)
...

Run the query with this calculation over a long time period. Use the produced value for the ingest delay outside LogScale to guide your decision for setting the Delay run and Time window on the scheduled search, or Time window on an aggregate alert. If you want all events, then it should be the produced value from the query above (difference between @ingesttimestamp and @timestamp) plus the event-latency-repo metric value which is the ingest delay inside of LogScale. This also means that results are delayed by that much. If you just want, e.g., 99% of events, use, for example, percentile(), max(), or avg() to specify that.

For more information about scheduled searches, see Scheduled searches. For more information about aggregate alerts, see Aggregate alerts.