Data Storage, Buckets and Archiving

Security Requirements and Controls

Data that is ingested into a repository is stored locally. To enable Falcon LogScale to effectively store more than would fit in the primary disk, secondary storage and bucket storage can be used to extend the overall capacity. Falcon LogScale intelligently moves data from the different tiers of storage to make the most recently used data on the primary storage, with older, less recently used data stored on secondary and then bucket storage.

There are several methods and factors related to storing LogScale data that you might consider. Below are links to pages describing the different methods and related topics.

Data Retention

To avoid servers reaching their maximum storage capabilities, Falcon LogScale can be configured to expire (delete) data when reaching a given threshold, such as the compressed file sizes, uncompressed file sizes, or the age of data.

Secondary Storage

Active data is stored on local diskss within each node of the Falcon LogScale cluster. Primary disks should be high performance SSD. For additional local storage, secondary storage, for example a lower performance SSD, can be used. Falcon LogScale will automatically move segment files to secondary storage once the primary disk reaches a configured limit.

Bucket Storage

To store larger volumes of data, bucket storage can be used. Similar to secondary storage, Falcon LogScale will move segments to solutions such as Amazon Bucket Storage or Google Bucket. Bucket storage also allows for deployment of nodes, expansion of an existing cluster, and to maintain back-ups in case a node or a cluster crashes.

S3 Archiving

Ingested log data can be archived to Amazon S3. Archiving stores a copy of the ingested data logs, but the the archived data is not searchable by Falcon LogScale as it is when stored on bucket storage. Archived storage can optionally be re-ingested or read by other software.

Before going any further, you should familiarize yourself with LogScale's storage rules, which is covered in the next section here.

To monitor the data storage:

  • Data storage across individual nodes can be monitoring using the Cluster nodes page

  • To monitor the amount of data stored across the cluster and the effects of compression, see Cluster statistics

  • For more detailed and historic information, use the humio/insights dashboard.

Storage Rules

In LogScale, data is distributed across the cluster nodes. Which nodes store what is chosen randomly. The only thing you as an operator can control is how big a portion is assigned to each node, and that multiple replicas are not stored on the same rack/machine/location (to ensure fault-tolerance).

Data is stored in units of segments, which are compressed files between 0.5GB and 1GB. For more information on segments and how data is stored and ingested, see Ingestion: Digest Phase.

See LogScale Multiple-byte Units for more information on how storage numbers are calculated.

Configuring Storage Rules

Data is distributed according to the cluster's storage rules. A Storage Rule is a relation between a storage partition and the set of nodes that should store all data written to that partition.

When a Digest Rules completes a data segment file (the internal data unit in LogScale), it is assigned to a random storage partition. Here's an example configuration:

Example Storage Rules

Partition ID Node
1 1,2
2 3,4
3 1,2

In this example the cluster is configured with three storage partitions and four nodes. Nodes 1 and 2 will receive 2/3 of all data written to the cluster, while nodes 3 and 4 only store 1/3 of all data. This is because 1 and 2 archive all data in partitions 1 and 3, while nodes 3 and 4 only archive the data in partition 2.

Replication Factor

Notice that in the example above there are the same number of nodes per partition. This is because we want a replication factor of 2, meaning that all data is stored on two nodes. If you had a partition with only one associated node, the replication factor would effectively be one for the entire cluster. This is because you cannot know which data goes into an given partition — and it does not make sense to say that a random subset of the data should only be stored in one copy.

If you want fault-tolerance, you should ensure your data is replicated across multiple nodes, physical servers, and geographical locations.

Storage Divergence

LogScale is capable of storing and searching across huge amounts of data. When LogScale Operational Architecture join or leave the cluster, data will usually need to be moved between nodes to ensure the replication factor is upheld and that no data is lost.

If your system contains very large amounts of data you cannot simply shuffle it around whenever a node leaves or enters the system. That is because moving terabytes or petabytes of data over the network can take a very long time and potentially impact system performance if done at the wrong time.

Data is stored in LogScale according to the cluster's Storage Rules, but when these rules are changed, for example when a storage node fails and is removed from the cluster, data is not automatically redistributed to match the new ruleset.

In other words the Storage Rules only apply to new data that is ingested. This means that data can end up being stored in fewer replicas than the configured replication factor. This is not necessarily a bad thing — it depends on how strict your replication requirements are. You can always redistribute it to match the current rules, but it is done as a separate step from changing rules.

At the top of the Cluster Node Management UI you can see the Storage Divergence indicated. This will in effect be the amount of data that will need to be sent between nodes in order to make the all the cluster's data conform to the current rules.

Retention Changes Apply Only to New Data

Suppose you have a cluster and want to increase your replication factor to four instead of the current two replicas. This would require having four nodes in each storage rule — which sets the replication factor to four.

Important

This change will only apply to new data entering the system. All existing data will only be kept in two copies. The reason for this is that the increased replication factor would mean that all data in the entire cluster would have to be transmitted between nodes. In a cluster with a large amount of data, this might not be what you want.

Redistribute Data for Storage Rules

If you want to make your effective data distribution match the current storage rules you can use the Cluster Management UI. At the bottom of the Storage Rules Panel on the right-hand side of the screen you can click Show Options, here you will be offered the option to Start Transfers.

If you click it you will see that the Traffic column of the nodes will indicate the shuffling of data around the cluster. If you make a mistake, you can always undo the change and click Start Transfers, effectively undoing the change.