Replacing Hardware in a Cluster

If you need to replace a node in your LogScale cluster, for whatever reason, you have a number of different options.

Cluster Node Identity

A cluster node is identified in the cluster by its UUID. The UUID is automatically generated the first time a node is started, and stored in $HUMIO_DATA_DIR/cluster_membership.uuid. When moving or replacing a node, you can use this file to ensure a node rejoins the cluster with the same identity.

If the node will continue to run on the same storage, meaning it keeps its data directory, all you will need to do is to ensure that the node is not a Digest Node before shutting down the node:

  1. Navigate to the Cluster nodes page. Find the node that you need to remove from the list, select it and then click the Mark for eviction action at the bottom of the page.

  2. Shutdown the LogScale process on the node.

    At this point you can see the node being unavailable in the Cluster Management UI.

  3. Replace the hardware components.

  4. Start the LogScale process.

    Your node should rejoin the cluster after a short time, and you will see the node becoming available in the Cluster Management UI.

If the node fails to come back, remove the node from the cluster completely using the Remove node action.

New Storage Target — Slow Recovery

You are moving a node to different machine, or installing a new disk or SSD.

There are two requirements that must be fulfilled:

  • Check if your cluster has multiple replicas of data Replication Factor >= 2) and it is acceptable for the cluster to be in a state of lower replication while the new hardware is being provisioned.

  • Make sure that the node does not contain any data for which it is the sole owner (this can occur if you have archive divergence).

    You can check this in the Cluster Management UI, indicated by red numbers in the Size column.

    In this case, the cluster can self-heal once the node reappears. It will discover that the node is missing data it was expected to have and will start re-sending it.

  1. Make a copy of the Node UUID file.

    While you won't have to copy all the data on the node you must make a backup of the Node UUID file.

    It is located in $HUMIO_DATA_DIR/cluster_membership.uuid; you will be copying it to the new data folder on the new storage target.

  2. Make a copy of the global snapshot file to ensure a backup in case it is corrupt in S3.

    It is located in $HUMIO_DATA_DIR/global-data-snapshot.json; you will be copying it to the new data folder on the new storage target.

  3. Shut down the LogScale process.

  4. Copy the Node UUID file from step 1 into the node's data folder.

  5. Start the LogScale process using the new storage.

    Your node should rejoin the cluster after a short time, and you will see the node becoming available in the Cluster Management UI.

    The other nodes will start re-sending the data that is missing, and the Too Low segment of the replication status in the header will initially be high, but will begin dropping as data is replicated.

New Storage Target — Quick Recovery

If you are moving the node to a new storage target and have hard replication requirements, or your cluster is only storing data in one replica, you cannot use the procedure in New Storage Target — Slow Recovery.

To limit the downtime of your node you should copy node's data directory before and shutting down the original node. This will ensure you only have to copy the most recent data when the node is taken offline.

  1. Use rsync or similar to copy the data directory to the new storage (this includes the UUID File).

  2. Assign another node to any Digest Rules where this node is assigned.

    This can be done using the LogScale's Management Cluster UI. You can read more about un-assigning digest rules in the section about removing a node.

  3. Shut down the LogScale process.

  4. Rerun rsync or similar to copy the most recent data to the new storage.

  5. Start the LogScale process. Your node should rejoin the cluster after a short time, and you will see the node become available in the Cluster Management UI.

Storage Malfunctions

If the configured storage malfunctions, and there is no replication or the node had data not found on other nodes a different solution is required to remove the node. There are two options:

  1. Restore the node from backup if you have that enabled. See Bucket Storage.

  2. Forcibly remove the node from the cluster. Any data that was not stored in multiple replicas will be lost. See Forcing Removal.