This dashboard is part of the Insights package. It will show you information related to each of your Humio nodes. When a node is having a problem, this can be useful in helping diagnose it.
This shows you how much CPU each Humio node is utilising. Within a cluster, if each Humio node has the same specifications, and the digest and storage partitions are evenly distributed, you would expect each Humio node to have about the same CPU usage.
If some nodes are experiencing particularly high usage, this indicates that something is wrong with Humio or your cluster setup.
This widget shows the number of CPU ticks used by each Humio thread group. Within a Humio cluster, the number of threads is usually dictated by the number of cores running on each node. The number of threads is then assigned to each particular thread group. The names of the groups should logically indicate the type of work being done by Humio. This widget can then help indicate the amount of time being spent by the CPU for each of these thread groups.
A common thread group to see consuming too many resources is the
humio-akka thread group. This is the group responsible for handling network requests. Therefore, this may take up more time since it spends a lot of time being idle waiting for responses.
Another common thread group is the
digester, which is the thread that handles digesting all new data coming into Humio.
Humio is built to run on the JVM. Therefore, we need to monitor the amount of Garbage Collection being done by the JVM. If Humio is spending a lot of time doing Garbage Collection, this could be consuming plenty of resources and thereby stop Humio from doing useful work such as digesting new data or running queries.
If there is a particular node doing much of the Garbage Collection, it could be worth restarting that node to see if it helps.
This widget shows you the percentage of memory being used for caching in comparison to each Humio node’s total memory allocation. Humio utilizes memory for caching segment files being used in queries for speed and performance.
This is a Humio Metric under the name
missing-cluster-nodes. This metric is reported by each Humio node and shows the number of nodes that each node has indicated as dead. A healthy system has zero of these.
This is a timechart showing which vHost (i.e., Humio Node ID) has shutdown in the given time range. If a node shutdown is unexpected, there could be
ERROR logs explaining why.
This is a Humio Metric named
failed-http-checks. This is the number of nodes that appear to be unreachable using
http, reported by each Humio node.
For each Humio node, this timechart shows the amount of bytes per second being transmitted and received by the network devices on each node.
This can be useful in diagnosing network throughput on each Humio node, especially if some nodes are slower than expected or if nodes are losing packets due to network issues.
This is a Humio metric which shows the number of current open file descriptors on each Humio node.
Humio needs to be able to keep plenty of files open for sockets and actual files from the file system. The default limit on Linux systems is usually too low. See this documentation page for more information around increasing the Open File Limit in Linux.
This table illustrates you each Humio node’s CPU architecture. It can be a useful reference. It will tell you which processor each node has, along with the number of vCPUs, threads per core and how much it holds for L1-L3.
This is a timechart for each Humio node showing the largest time skew in milliseconds between this node and any other node in the cluster.
Keeping the time skew between Humio nodes as low as possible is important as Humio relies on system times being accurate for it to work as expected. To keep the time skew low between nodes, keep the nodes synched using something like NTP.
This is a timechart showing the number of Humio logged events per Humio node.
This timechart shows the Humio versions that have been applied onto the cluster in the past 24 hours. This can be useful to correlate if the time of an upgrade may correlate to a change happening in another widget.
This shows a timechart of the Primary Local storage disk usage in percent. Humio by default limits disk usage to 85% to avoid disks reaching their maximum capacity. It’s very important not to let this happen as it could result in loss of data.
If you have Secondary Storage configured, this timechart will show you the disk usage in percent of your secondary disk.