Replacing Hardware in a Cluster

If you need to replace a node in your LogScale cluster, for whatever reason, you have a number of different options.

Cluster Node Identity

A cluster node is identified in the cluster by its UUID. The UUID is automatically generated the first time a node is started, and stored in $HUMIO_DATA_DIR/cluster_membership.uuid. When moving or replacing a node, you can use this file to ensure a node rejoins the cluster with the same identity.

This page of the documentation assumes a basic hardware and network setup. If you're using SANs, Blue-Green Deployment, or other advanced techniques, you can use this as a reference for those more advanced configurations.

The node will continue operating with the same storage.

If the node will continue to run on the same storage, meaning it keeps its data directory, all you will need to do is to ensure that the node is not a Digest Node before shutting down the node:

  1. Assign another node to any Digest Rules where this node is assigned.

    This can be done using the LogScale's Management Cluster UI. You can read more about un-assigning digest rules on the Adding & Removing Nodes documentation page.

  2. Shutdown the LogScale process on the node.

    At this point you can see the node being unavailable in the Cluster Management UI.

  3. Replace the hardware components.

  4. Start the LogScale process.

    Your node should rejoin the cluster after a short time, and you will see the node becoming available in the Cluster Management UI.

  5. Reassign the Digest Rules (if you unassigned any in Step 1).

New Storage Unit — Slow Recovery

You are moving a node to different machine, or installing a new disk or SSD.

There are two requirements that must be fulfilled:

  • Check if your cluster has multiple replicas of data Replication Factor >= 2) and it is acceptable for the cluster to be in a state of lower replication while the new hardware is being provisioned.

  • Make sure that the node does not contain any data for which it is the sole owner (this can occur if you have archive divergence).

    You can check this in the Cluster Management UI, indicated by red numbers in the Size column.

    In this case, the cluster can self-heal once the node reappears. It will discover that the node is missing data it was expected to have and will start re-sending it.

  1. Make a copy of the Node UUID file.

    While you won't have to copy all the data on the node you must make a backup of the Node UUID file.

    It is located in $HUMIO_DATA_DIR/cluster_membership.uuid; you will be copying it to the new data folder on the new storage unit.

  2. Make a copy of the global snapshot file to ensure a backup in case it is corrupt in S3.

    It is located in $HUMIO_DATA_DIR/global-data-snapshot.json; you will be copying it to the new data folder on the new storage unit.

  3. Assign another node to any Digest Rules where this node is assigned.

    This can be done using the LogScale's Management Cluster UI. You can read more about un-assigning digest rules in the section about removing a node.

  4. Shut down the LogScale process.

  5. Copy the Node UUID file from step 1 into the node's data folder.

  6. Start the LogScale process using the new storage.

    Your node should rejoin the cluster after a short time, and you will see the node becoming available in the Cluster Management UI.

    The other nodes will start re-sending the data that is missing, and the Too Low segment of the replication status in the header will initially be high, but will begin dropping as data is replicated.

  7. Reassign the Digest Rules (if you unassigned any in Step 3).

New Storage Unit — Quick Recovery

If you are moving the node to a new storage unit and have hard replication requirements, or your cluster is only storing data in one replica, you cannot use the procedure in New Storage Unit — Slow Recovery.

To limit the downtime of your node you should copy node's data directory before and shutting down the original node. This will ensure you only have to copy the most recent data when the node is taken offline.

  1. Use rsync or similar to copy the data directory to the new storage (this includes the UUID File).

  2. Assign another node to any Digest Rules where this node is assigned.

    This can be done using the LogScale's Management Cluster UI. You can read more about un-assigning digest rules in the section about removing a node.

  3. Reassign any archive rules to other cluster nodes.

    This can be done using LogScale's Management Cluster UI. You can read more about un-assigning archive rules in the section about removing a node.

  4. Shut down the LogScale process.

  5. Rerun rsync or similar to copy the most recent data to the new storage.

  6. Start the LogScale process. Your node should rejoin the cluster after a short time, and you will see the node become available in the Cluster Management UI.

  7. Reassign the Digest Rules and Archive Rules (if you unassigned any in Step 2 and 3).

Storage Malfunctions

Storage malfunctions and you're running with no replication or the node had data not found on other nodes.

In the case where storage cannot be recovered, there are two options:

  1. Restore the node from backup if you have that enabled. See Bucket Storage.

  2. Forcibly remove the node from the cluster. Any data that was not stored in multiple replicas will be lost. See Forcing Removal.