Insights Hosts Dashboard
This dashboard is part of the Insights package. It will show you information related to each of your LogScale nodes. When a node is having a problem, this can be useful in helping diagnose it.
CPU Usage in Percent
This shows you how much CPU each LogScale node is using. Within a cluster, if each LogScale node has the same specifications, and the digest partitions are evenly distributed, you would expect each LogScale node to have about the same CPU usage.
If some nodes are experiencing particularly high usage, this indicates that something is wrong with LogScale or your cluster setup.
CPU Usage: Thread Group Ticks
This widget shows the number of CPU ticks used by each LogScale thread group. Within a LogScale cluster, the number of threads is usually dictated by the number of cores running on each node. The number of threads is then assigned to each particular thread group. The names of the groups should logically indicate the type of work being done by LogScale. This widget can then help indicate the amount of time being spent by the CPU for each of these thread groups.
A common thread group to see consuming too many resources is the
humio-pekko
thread
group (formerly
humio-akka
). This is
the group responsible for handling network requests. Therefore, this
may take up more time since it spends a lot of time being idle waiting
for responses.
Another common thread group is the
digester
, which is the thread
that handles digesting all new data coming into LogScale.
JVM Garbage Collection Time
LogScale is built to run on the JVM. Therefore, we need to monitor the amount of Garbage Collection being done by the JVM. If LogScale is spending a lot of time doing Garbage Collection, this could be consuming plenty of resources and thereby stop LogScale from doing useful work such as digesting new data or running queries.
If there is a particular node doing much of the Garbage Collection, it could be worth restarting that node to see if it helps.
Memory: System Usage Percentage
This widget shows you the percentage of memory being used for caching in comparison to each LogScale node's total memory allocation. LogScale utilizes memory for caching segment files being used in queries for speed and performance.
Missing Nodes
This is a LogScale Metrics under the name
missing-cluster-nodes
.
This metric is reported by each LogScale node and shows the number of
nodes that each node has indicated as dead. A healthy system has zero
of these.
Node Shutdowns
This is a timechart showing which vHost (i.e. LogScale Node ID) has
shutdown in the given time range. If a node shutdown is unexpected,
there could be ERROR
logs explaining why.
Failed HTTP Checks
This is a Node-Level Metrics named
failed-http-checks
.
This is the number of nodes that appear to be unreachable using
http
, reported by each LogScale
node.
Networking (Bytes per second)
For each LogScale node, this timechart shows the amount of bytes per second being transmitted and received by the network devices on each node.
This can be useful in diagnosing network throughput on each LogScale node, especially if some nodes are slower than expected or if nodes are losing packets due to network issues.
Open File Descriptors
This is a LogScale metric which shows the number of current open file descriptors on each LogScale node.
LogScale needs to be able to keep plenty of files open for sockets and actual files from the file system. The default limit on Linux systems is usually too low. See this documentation page for more information around increasing the Increase Open File Limit in Linux.
CPU Architecture
This table illustrates you each LogScale node's CPU architecture. It can be a useful reference. It will tell you which processor each node has, along with the number of vCPUs, threads per core and how much it holds for L1-L3.
Cluster Time Skew
This is a timechart for each LogScale node showing the largest time skew in milliseconds between this node and any other node in the cluster.
Keeping the time skew between LogScale nodes as low as possible is important as LogScale relies on system times being accurate for it to work as expected. To keep the time skew low between nodes, keep the nodes synced using something like NTP.
Logged Events
This is a timechart showing the number of LogScale logged events per LogScale node.
LogScale Versions
This timechart shows the LogScale versions that have been applied onto the cluster in the past 24 hours. This can be useful to correlate if the time of an upgrade may correlate to a change happening in another widget.
Primary Disk Usage
This shows a timechart of the Primary Local storage disk usage in percent. LogScale by default limits disk usage to 85% to avoid disks reaching their maximum capacity. It's very important not to let this happen as it could result in loss of data.
Secondary Disk Usage
If you have Secondary Storage configured, this timechart will show you the disk usage in percent of your secondary disk.