Ingest latency (99th percentile). This latency is measured from the time an event in received by Humio and until the digest phase is done processing that event (running live searches and persisting to disk).
This check will return a
WARN status if the latency is higher than the configuration parameter
HEALTH_CHECK__EVENT_LATENCY_P99__WARN_THRESHOLD_SEC (defaults to 30).
Humio is build for low ingest latency, often sub-second, and a high latency is usually a sign that something is not working as expected
There are a number of situations that will give higher latency for a shorter time period because Humio will need to catch up with ingest flow.
- When a node has just started.
- When a node takes over ingest from another (failed) node.
- When changing the digest partition schema.
In a number of situations high latency is a symptom of something stopping the system from running as expected. Humio will take corrective measures in a number of different ways, but when ingest latency starts to raise the following is a list of causes that we have seen historically can cause high latency. (We have an ongoing effort to make Humio resilient to overload situations and are improving both the corrective measure Humio takes by itself and making it more transparent what is happening in the cluster. If you experience overload situations we will be very interested in working with you to fix the situation and improve the resiliency of Humio going forward.)
- It might be that the amount of ingest is bigger than what the cluster can cope with. Humio can handle a high amount of ingest, but for a given cluster there is always a breaking point. Overloading a Humio cluster will result in ingest latency raising.
- Heavy historical queries. Humio has an efficient query engine, but some searches is inherently heavy and running those over large data sets will can use up so much CPU that ingest falls behind. Especially if the system is nearing it maximum capacity. The Query Monitor in the administrative page can be used to find and disable problematic queries if this happens.
- Heavy live searched using too much CPU. Live searches sits on the critical path for ingest and add latency. If these are heavy they might make the ingest fall behind. Live searches can, similarly to historically searches, be seen in the Query Monitor.
- Kafka too slow. Kafka is usually not the limiting factor in a Humio cluster, but it should be noted that Kafka sits in the critical path for ingest and if Kafka in not dimensioned to the ingest load of a given cluster, ingest will fall behind.
Making Humio behave resilient in overload situations in an ongoing effort. The following is some of the corrective measures Humio will take. Humio will strive towards having low ingest latency, at the expense of searches if necessary.
- Auto-sharding - If a given data source is falling behind it will be split artificially in auto-shards until each shard is small enough to cope with the ingest load. This can happen if the ingest suddenly increases or a heavy live search is started.
- Quotas - Users can be assigned limited search quotas, so they cannot take all resources for an extended period.
- Cancelling heavy searches - In situation when Humio detects that ingest latency is raising it will start to cancel the heavy searches, starting with dashboard searches that has not been polled for a long time.
Number of failed http status checks within the last 60 seconds.
This check will return a
WARN status if the count is 1 or more. Note, this check might count the same host multiple times within the same time interval.
Nodes in Humio use HTTPS when running searches and can only work if all (digest and storage) nodes can reach each other.
Note that when a node dies, both this health check and
missing-nodes-via-kafka will eventually fails, but this might not happen at the same time as the checks run asynchronously.