Health Checks

LogScale exposes information about its own health. This information is available through the Health Check API.

The overall health of a LogScale node can have 3 states:

  • OK โ€” All health checks are within normal operational parameters.

  • WARN โ€” At least one health check is in a WARN state and an operator should investigate the cause.

  • DOWN โ€” At least one health check is in a DOWN state. The node is down and not working. With dynamic load balancers, you should remove down nodes from the active set.

Part of the overall health state is the uptime value. When a node has just started, there is a grace period where the overall state will always be WARN even if some checks are DOWN. This gives the cluster time to become stable and indicates that things might fluctuate. You can set the grace period with the configuration parameter HEALTH_CHECK__GRACE_PERIOD_SEC. The default value is 30.

Determining if a situation is affecting the health of a node can be specific to each installation. Therefore, most checks have configuration parameters to control thresholds. See the reference below.

Health Check Reference

Health Check Name Description
cluster-time-skew Measures time skew between cluster nodes. This check returns WARN if time skew is greater than or equal to the value set in HEALTH_CHECK__CLUSTER_TIME_SKEW__WARN_THRESHOLD_MS (defaults to 15000). LogScale will exhibit strange behavior if the cluster nodes are out of sync. Always run NTP on LogScale clusters.
cluster-versions Checks if all nodes are running the same LogScale version. This check returns a WARN if there are two or more reported versions in the cluster. Only updated once per minute. LogScale clusters should always run the same version on each node.
event-latency-p99

Ingest latency (99th percentile). This latency is measured from the time an event in received by LogScale and until the digest phase is done processing that event (running live searches and persisting to disk). This check returns a WARN status if the latency is higher than the configuration parameter HEALTH_CHECK__EVENT_LATENCY_P99__WARN_THRESHOLD_SEC (defaults to 30). LogScale is built for low ingest latency, often sub-second. High latency is usually a sign that something is not working as expected. Several situations can cause higher latency for a shorter time period because LogScale needs to catch up with ingest flow:

  • When a node has just started.

  • When a node takes over ingest from another (failed) node.

  • When changing the digest partition schema. In several situations, high latency is a symptom of something stopping the cluster from running as expected. LogScale will take corrective measures in different ways. However, when ingest latency starts to rise, the following is a list of causes that have historically caused high latency.

If you experience overload situations, contact support to resolve the issue and improve LogScale resiliency.

  • The amount of ingest might be bigger than what the cluster can handle. LogScale can handle a high amount of ingest, but for a given cluster there is always a breaking point. Overloading a LogScale cluster will result in ingest latency raising.

  • Heavy historical queries. LogScale has an efficient query engine, but some searches are inherently heavy. Running those over large data sets can use so much CPU that ingest falls behind, especially if the cluster is nearing its maximum capacity.

    The Query Monitor in the administrative page can be used to find and disable problematic queries if this happens.

  • Heavy live searches using too much CPU. Live searches sit on the critical path for ingest and add latency. If these are heavy, they might make the ingest fall behind. Live searches can be seen in the Query Monitor, similarly to historical searches.

  • Kafka too slow. Kafka is usually not the limiting factor in a LogScale cluster. However, Kafka sits in the critical path for ingest. If Kafka is not dimensioned to the ingest load of a given cluster, ingest will fall behind. Improving LogScale's resiliency in overload situations is an ongoing effort.

LogScale takes the following corrective measures to attempt to lower latency. LogScale strives for low ingest latency, even at the expense of searches if necessary.

  • Auto-sharding โ€” If a given data source is falling behind, it will be split artificially into auto-shards until each shard is small enough to cope with the ingest load. This can happen if the ingest suddenly increases or a heavy live search is started.

  • Quotas โ€” Users can be assigned limited search quotas, so they cannot take all resources for an extended period.

  • Canceling heavy searches โ€” In situations when LogScale detects that ingest latency is increasing it will start to cancel the heavy searches, starting with dashboard searches that have not been polled for a long time.

failing-http-status-checks Number of failed http status checks within the last 60 seconds. This check returns a WARN status if the count is 1 or more. Note that this check might count the same host multiple times within the same time interval. Nodes in LogScale use HTTPS when running searches and can only work if all (digest and storage) nodes can reach each other. When a node dies, both this health check and missing-nodes-via-kafka will eventually fail. This might not happen at the same time because the checks run asynchronously.
global-topic-latency-median This monitors the latency on the global-events topic that is the shared channel within the cluster for internal communication. This check returns a WARN status if the latency is higher than the configuration parameter HEALTH_CHECK__GLOBAL_TOPIC_LATENCY_P50__WARN_THRESHOLD_MSEC (defaults to 50). LogScale requires fairly low round trip latency on this specific topic partition. The target median latency is maximum 5ms. See the global-publish-wait-for-value metric for the underlying metric values.
missing-nodes-via-kafka Number of nodes not heard from through Kafka within the last 90 seconds. This check returns a WARN status if the node count is 1 or more. Kafka is used to ship ingest data around the cluster and for sharing global metadata. A node not present in Kafka means that it is unavailable for processing data. As long as the number of nodes missing is fewer than the replication factor, this should not affect LogScale. However, an operator should take corrective measures immediately to get all nodes back up and running. The CLUSTER_PING_TIMEOUT_SECONDS sets the time interval when waiting for a response.
primary-disk-usage Percent used of the primary disk. This check returns WARN if % >= the value set in HEALTH_CHECK__PRIMARY_DISK_USAGE__WARN_THRESHOLD_SEC percent (defaults to 90). LogScale (and Kafka) will crash if data directories run out of free disk space. A crash demands manual intervention and will quite possibly result in loss of data in transit.
secondary-disk-usage Percent used of the secondary disk (only present when secondary disk is used). This check returns WARN if % >= the value set in HEALTH_CHECK__SECONDARY_DISK_USAGE__WARN_THRESHOLD_SEC percent (defaults to 90). LogScale (and Kafka) will crash if data directories run out of free disk space. A crash demands manual intervention and will quite possibly result in loss of data in transit.