Health Checks
Humio exposes information about its own health. This information is available through the Health Check API. This feature is in development and as such will continue to change. Please check the documentation and release notes for updates.
The overall health of a Humio node can have 3 states:
OK
, WARN
, and
DOWN
.
OK
— All health checks are within normal operational parameters.WARN
— At least one health check is in a WARN state and an operator should look into why that is.DOWN
— At least one health check is in a DOWN state. The node is down and not working. With dynamic load balancers, nodes that are down should be removed from the active set of nodes.
Part of the overall health state is the uptime value. When a node has just
started, there is a grace period where the overall state will always be
WARN
even if some checks are DOWN
.
This is to give the system time to become stable, but also to indicate
that things might fluctuate a bit. The grace period can be set with the
configuration parameter HEALTH_CHECK_GRACE_PERIOD_SEC
. The
default value is 30
.
Determining if a situation is affecting the health of a node can be specific to each installation. Therefore, most checks will have configuration parameters for controlling thresholds. See the reference below.
Health Check Reference
Health Check Name | Description |
---|---|
backup-disk-usage
|
Percent used of the backup disk (only present when backup disk
is used). Will be WARN if used %
>=
HEALTH_CHECK__BACKUP_DISK_USAGE__WARN_THRESHOLD_SEC
seconds (defaults to 90) Humio (and Kafka) will crash if data
directories run out of free disk space and doing so will demand
manual intervention and quite possible loose data in transit.
|
event-latency-p99
|
Ingest latency (99th percentile). This latency is measured from
the time an event in received by Humio and until the digest
phase is done processing that event (running live searches and
persisting to disk). This check will return a
WARN status if the latency is higher than the
configuration parameter
HEALTH_CHECK__EVENT_LATENCY_P99__WARN_THRESHOLD_SEC
(defaults to 30). Humio is build for low ingest latency, often
sub-second, and a high latency is usually a sign that something
is not working as expected There are a number of situations that
will give higher latency for a shorter time period because Humio
will need to catch up with ingest flow. - When a node has just
started. - When a node takes over ingest from another (failed)
node. - When changing the digest partition schema. In a number
of situations high latency is a symptom of something stopping
the system from running as expected. Humio will take corrective
measures in a number of different ways, but when ingest latency
starts to raise the following is a list of causes that we have
seen historically can cause high latency. (We have an ongoing
effort to make Humio resilient to overload situations and are
improving both the corrective measure Humio takes by itself and
making it more transparent what is happening in the cluster. If
you experience overload situations we will be very interested in
working with you to fix the situation and improve the resiliency
of Humio going forward.) - It might be that the amount of ingest
is bigger than what the cluster can cope with. Humio can handle
a high amount of ingest, but for a given cluster there is always
a breaking point. Overloading a Humio cluster will result in
ingest latency raising. - Heavy historical queries. Humio has an
efficient query engine, but some searches is inherently heavy
and running those over large data sets will can use up so much
CPU that ingest falls behind. Especially if the system is
nearing it maximum capacity. The Query Monitor in the
administrative page can be used to find and disable problematic
queries if this happens. - Heavy live searched using too much
CPU. Live searches sits on the critical path for ingest and add
latency. If these are heavy they might make the ingest fall
behind. Live searches can, similarly to historically searches,
be seen in the Query Monitor. - Kafka too slow. Kafka is usually
not the limiting factor in a Humio cluster, but it should be
noted that Kafka sits in the critical path for ingest and if
Kafka in not dimensioned to the ingest load of a given cluster,
ingest will fall behind. Making Humio behave resilient in
overload situations in an ongoing effort. The following is some
of the corrective measures Humio will take. Humio will strive
towards having low ingest latency, at the expense of searches if
necessary. - Auto-sharding - If a given data source is falling
behind it will be split artificially in auto-shards until each
shard is small enough to cope with the ingest load. This can
happen if the ingest suddenly increases or a heavy live search
is started. - Quotas - Users can be assigned limited search
quotas, so they cannot take all resources for an extended
period. - Cancelling heavy searches - In situation when Humio
detects that ingest latency is raising it will start to cancel
the heavy searches, starting with dashboard searches that has
not been polled for a long time.
|
failing-http-status-checks
|
Number of failed http status checks within the last 60 seconds.
This check will return a WARN status if the
count is 1 or more. Note, this check might count the same host
multiple times within the same time interval. Nodes in Humio use
HTTPS when running searches and can only work if all (digest and
storage) nodes can reach each other. Note that when a node dies,
both this health check and
missing-nodes-via-kafka will eventually
fails, but this might not happen at the same time as the checks
run asynchronously.
|
missing-nodes-via-kafka
|
Number of nodes not heard from via kafka within the last 90
seconds. This check will return a WARN status
if the node count is 1 or more. Kafka is used to ship ingest
data around the cluster and for sharing global meta data. A node
not present in Kafka means that it is unavailable. As long as
the number of nodes missing is fewer that the replication factor
this should not effect Humio, but an operator should immediately
take corrective measures to get all nodes back up and running.
|
primary-disk-usage
|
Percent used of the primary disk. Will be WARN if used
% >=
HEALTH_CHECK__PRIMARY_DISK_USAGE__WARN_THRESHOLD_SEC
seconds (defaults to 90) Humio (and Kafka) will crash if data
directories run out of free disk space and doing so will demand
manual intervention and quite possible loose data in transit.
|
secondary-disk-usage
|
Percent used of the secondary disk (only present when secondary
disk is used). Will be WARN if used % >=
HEALTH_CHECK__SECONDARY_DISK_USAGE__WARN_THRESHOLD_SEC
seconds (defaults to 90). Humio (and Kafka) will crash if data
directories run out of free disk space and doing so will demand
manual intervention and quite possible loose data in transit.
|