Health Checks
LogScale exposes information about its own health. This information is available through the Health Check API.
The overall health of a LogScale node can have 3 states:
- OK — All health checks are within normal operational parameters. 
- WARN — At least one health check is in a WARN state and an operator should investigate the cause. 
- DOWN — At least one health check is in a DOWN state. The node is down and not working. With dynamic load balancers, nodes that are down should be removed from the active set of nodes. 
    Part of the overall health state is the uptime value. When a node has just
    started, there is a grace period where the overall state will always be
    WARN even if some checks are DOWN.
    This is to give the system time to become stable, but also to indicate that
    things might fluctuate a bit. The grace period can be set with the
    configuration parameter HEALTH_CHECK__GRACE_PERIOD_SEC. The
    default value is
    30.
  
Determining if a situation is affecting the health of a node can be specific to each installation. Therefore, most checks have configuration parameters to control thresholds. See the reference below.
Health Check Reference
| Health Check Name | Description | 
|---|---|
| cluster-time-skew | Measures time skew between cluster nodes. This check returns
              WARN if time skew is greater than or equal to
              the value set in HEALTH_CHECK__CLUSTER_TIME_SKEW__WARN_THRESHOLD_MS(defaults to 15000). LogScale will exhibit strange
              behavior if the cluster nodes are out of sync. Always run NTP on a
              LogScale cluster. | 
| cluster-versions | Checks if all nodes are running the same LogScale version. This check returns a WARN if there are two or more reported versions in the cluster. Only updated once per minute. LogScale clusters should always run the same version on each node. | 
| event-latency-p99 | 
                Ingest latency (99th percentile). This latency is measured from
                the time an event in received by LogScale and until the
                digest phase is done processing that event (running live
                searches and persisting to disk). This check returns a
                WARN status if the latency is higher than
                the configuration parameter
                 
 If you experience overload situations LogScale would be very interested in working with you to fix the situation and improve the resiliency of LogScale going forward. 
 The following are some of the corrective measures LogScale will take to attempt to lower latency. LogScale strives towards having low ingest latency at the expense of searches, if necessary. 
 | 
| failing-http-status-checks | Number of failed http status checks within the last 60 seconds.
              This check returns a WARN status if the count
              is 1 or more. Note, this check might count the same host multiple
              times within the same time interval. Nodes in LogScale use
              HTTPS when running searches and can only work if all (digest and
              storage) nodes can reach each other. When a node dies, both this
              health check and missing-nodes-via-kafkawill
              eventually fail. This might not happen at the same time as the
              checks run asynchronously. | 
| global-topic-latency-median | This monitors the latency on the global-events topic that is the
              shared channel within the cluster for internal communication. This
              check returns a WARN status if the latency is
              higher than the configuration parameter HEALTH_CHECK__GLOBAL_TOPIC_LATENCY_P50__WARN_THRESHOLD_MSEC(defaults to 50). LogScale requires fairly low round trip
              latency on this specific topic partition. The target median
              latency is maximum 5ms. See theglobal-publish-wait-for-valuemetric for the underlying metric values. | 
| missing-nodes-via-kafka | Number of nodes not heard from via Kafka within the last 90
              seconds. This check returns a WARN status if
              the node count is 1 or more. Kafka is used to ship ingest data
              around the cluster and for sharing global metadata. A node not
              present in Kafka means that it is unavailable for processing data.
              As long as the number of nodes missing is fewer that the
              replication factor this should not effect LogScale, but an
              operator should take corrective measures immediately to get all
              nodes back up and running. The CLUSTER_PING_TIMEOUT_SECONDSsets the time interval
              when waiting for a response. | 
| primary-disk-usage | Percent used of the primary disk. This check returns
              WARN if %
              >=the value set inHEALTH_CHECK__PRIMARY_DISK_USAGE__WARN_THRESHOLD_SECpercent (defaults to 90). LogScale (and Kafka) will crash
              if data directories run out of free disk space; a crash demands
              manual intervention and will quite possibly result in loss of data
              in transit. | 
| secondary-disk-usage | Percent used of the secondary disk (only present when secondary
              disk is used). This check returns WARN if % >=the value set inHEALTH_CHECK__SECONDARY_DISK_USAGE__WARN_THRESHOLD_SECpercent (defaults to 90). LogScale (and Kafka) will crash
              if data directories run out of free disk space; a crash demands
              manual intervention and will quite possibly result in loss of data
              in transit. |