Field-Based Throttling
You can use field-based throttling, if you want to only throttle certain results from your Alert.
For example, if you have an Alert that triggers when a machine is running out of disk space, you might want to throttle further messages for the same machine. However, you still want to receive a message, if another machine also starts running out of disk space within the throttle period. Then you can decide to throttle on Throttle only events with identical field values, and select the field in your logs containing the name of the machine.
Say that you have such an Alert, which is a search for a specific log
event with a time window of 1 hour and a throttle period of 1 hour. At
some point, machine1
runs out of disk space, which
results in an event in the log, and the Alert triggers on this event.
The Alert search will continue to run and find this event every time,
but it will not trigger the Alert, since it is throttled. After some
time, machine2
also runs out of disk space. The Alert
search will now find both events, but will only trigger for
machine2
, since machine1
is
throttled. After an hour, if machine1
is still out of
disk space (and thus there are newer log events for this), the Alert
will trigger again for machine1
.
The field you throttle on should be in the result of the query, not just in the events that are input to the query. If a result from the query does not contain the field, it will be treated as if it had an empty value for the field.
When an Alert triggers, Humio stores the value of the throttle field in memory. To limit memory usage, there is a fixed limit on the number of values, which Humio stores per Alert. Thus, if you select a throttle field that can assume more values than the limit, your Alert might trigger more frequently than indicated by the given throttle period.
Multiple Fields
It is only possible to throttle on a single field. If you need to throttle on multiple fields, you can simply add a new field that concatenates these fields in the Alert query.
For example, if your events have a service
and a
host
field, and you want to throttle on the
combination of these, you can add a new field in the Alert query by
adding the following line to it:
| serviceathost := concat([service, host]])
and then throttle on serviceathost
.
Relation between Throttle Period and Time Window
If your search finds specific events, that you want to trigger the Alert on, for example specific errors, you want to set the throttle period to match the time window of the search. If you set the throttle period higher than the time window, you might miss events, and if you set it lower, you might get duplicate Alerts.
If your search involves an aggregate, you might want to set the time window larger in some cases. For example, if you want to be notified every hour, whether there are more than 5 errors within a 4 hour search window. You probably do not want to set the time window smaller than the throttle period, as this means that there will be events that are never evaluated by the Alert. For actions like email and Slack, you want a higher throttle period since these triggers do not deduplicate.
Errors & Warnings
If there is an error when running an Alert, the error will be logged and also set on the Alert, so that it can be seen on the Alerts overview page. If an Alert has multiple Actions attached, and some of the Actions fail to run, this will be logged, but no error will be set on the Alert. The Alert will be considered to have fired, and will be throttled as normal. It will only be considered an error if all Actions fail.
If there are warnings from running the Alert query, they are logged
and also set as errors on the Alert. Many warnings are transient and
will go away after some time, but some require user interaction, for
instance a warning on too many groups in a
groupBy()
function invocation in the Alert query.
Some warnings will result in the Alert query only returning partial
results, which may trigger the Alert when it should not have been
triggered, or make the Alert only return some of the events it would
otherwise have returned. There is usually a lot of warnings on Alert
queries right after Humio starts up, for instance indicating that
Humio is trying to catch up on ingested data. Because of this, the
default behavior is to not fire an Alert if there are warnings from
the Alert query and instead wait for the warning to go away.