Buckets and Archive Storage
Data that comes into a repository is generally stored locally, on the server where LogScale and the repository is located. Eventually, you may accumulate and retain too much data, including old data that you don't often search. At a minimum, this will affect LogScale performance when searching more current data. You also risk losing part or all of the data, should your server crash when running LogScale on a single server.
First, you might simple establish a system to delete old data. Another remedy is to use LogScale in a cluster of nodes. That generally improves performance and allows for redundancy of data. A simpler method is to make use of external storage. This may be used in addition to a cluster and as a part of it.
There are several methods and factors related to storing LogScale data that you might consider. Below are links to pages describing the different methods and related topics.
So that servers don't reach their maximum storage capabilities, you can set LogScale to delete data based on compressed file sizes, uncompressed file sizes, and based on age of data. Click on the heading here to read more on data retention.
Secondary storage is a way to keep the local drives from reaching capacity. When enabled, LogScale will move segment files to secondary storage once the primary disk reaches whatever limit you set.
Similar to secondary storage, but utilizes specialized storage with web service providers like Amazon Bucket Storage Google Cloud Bucket Storage. Bucket storage is particularly useful, though, in a cluster in that it makes deployment of nodes easier and helps to maintain back-ups in case a node or a cluster crashes.
Ingested logs may be archived to Amazon S3. LogScale won't be able to search that data, though. However, the data may accessed by any external system that integrates with S3.
Before going any further, you should familiarize yourself with LogScale's storage rules, which is covered in the next section here.
In LogScale, data is distributed across the cluster nodes. Which nodes store what is chosen randomly. The only thing you as an operator can control is how big a portion is assigned to each node, and that multiple replicas are not stored on the same rack/machine/location (to ensure fault-tolerance).
Data is stored in units of
segments, which are compressed
files between 0.5GB and 1GB.
A cluster will divide data into partitions (or buckets); we cannot know exactly which partition a given data segment will be put in. Partitions are chosen randomly to spread the data evenly across nodes.
The number of partitions is configurable but that is not important initially — the default number is 48 partitions.
Configuring Storage Rules
Data is distributed according to the cluster's storage rules. A Storage Rule is a relation between a storage partition and the set of nodes that should store all data written to that partition.
When a Digest Rules completes a data segment file (the internal data unit in LogScale), it is assigned to a random storage partition. Here's an example configuration:
Example Storage Rules
In this example the cluster is configured with three storage
partitions and four nodes. Nodes
2 will receive 2/3 of all data
written to the cluster, while nodes
4 only store 1/3 of all data.
This is because
2 archive all data in partitions
3, while nodes
4 only archive the data in
Notice that in the example above there are the same number of nodes per partition. This is because we want a replication factor of 2, meaning that all data is stored on two nodes. If you had a partition with only one associated node, the replication factor would effectively be one for the entire cluster. This is because you cannot know which data goes into an given partition — and it does not make sense to say that a random subset of the data should only be stored in one copy.
If you want fault-tolerance, you should ensure your data is replicated across multiple node, physical servers, and geographical locations.
UI for Storage Nodes
From your account profile menu, select
System Administrationpage → Cluster nodes tab, check the storage displayed under Replication:
Figure 288. UI for Storage Nodes
The figure above shows a cluster of 3 nodes where each node is assigned to 2 archive partitions leading to a replication factor of 2.
There is also a tab for
Digest Rules and it is important
to understand that the Digest Partitions and Storage Partitions are
not related in any way. For example, a Digest Partition with
ID=1 does not contain the same
data as are written to the Storage Partition with
LogScale is capable of storing and searching across huge amounts of data. When Cluster Nodes join or leave the cluster, data will usually need to be moved between nodes to ensure the replication factor is upheld and that no data is lost.
If your system contains very large amounts of data you cannot simply shuffle it around whenever a node leaves or enters the system. That is because moving terabytes or petabytes of data over the network can take a very long time and potentially impact system performance if done at the wrong time.
Data is stored in LogScale according to the cluster's Storage Rules, but when these rules are changed, for example when a storage node fails and is removed from the cluster, data is not automatically redistributed to match the new ruleset.
In other words the Storage Rules only apply to new data that is ingested. This means that data can end up being stored in fewer replicas than the configured replication factor. This is not necessarily a bad thing — it depends on how strict your replication requirements are. You can always redistribute it to match the current rules, but it is done as a separate step from changing rules.
At the top of the Cluster Node Management UI you can see the Storage Divergence indicated. This will in effect be the amount of data that will need to be sent between nodes in order to make the all the cluster's data conform to the current rules.
Retention Changes Apply Only to New Data
Suppose you have a cluster and want to increase your replication factor to four instead of the current two replicas. This would require having four nodes in each storage rule — which sets the replication factor to four.
Note, this change will only apply to new data entering the system. All existing data will only be kept in two copies. The reason for this is that the increased replication factor would mean that all data in the entire cluster would have to be transmitted between nodes. In a cluster with a large amount of data, this might not be what you want.
A Node is Removed Uncleanly
If you Adding & Removing Nodes from the cluster without first handing over the node's data to other nodes, there will be one less version of whatever data it was holding. In this case that effective data distribution will diverge from the current rules, indicated by the Too Low segment of the replication bar in the Cluster Management UI.
Redistribute Data for Storage Rules
If you want to make your effective data distribution match the current storage rules you can use the Cluster Management UI. At the bottom of the Storage Rules Panel on the right-hand side of the screen you can click Show Options, here you will be offered the option to Start Transfers.
If you click it you will see that the Traffic column of the nodes will indicate the shuffling of data around the cluster. If you make a mistake, you can always undo the change and click Start Transfers, effectively undoing the change.
Storage Metrics within the LogScale UI - Cluster Stats
Figure 289. Cluster Stats
These metrics are applicable to both cluster and single node installations.
On the front page of LogScale, there is a Cluster Stats box that gives you information regarding how much data is in the cluster. These storage statistics are meant to represent the searchable data within LogScale. This indicates it includes the compressed ingested data that is found within LogScale segment files. It also means it includes LogScale's own system data, but does not include the duplicated or replicated data.
Storage Metrics within the LogScale UI - Cluster Nodes
Figure 290. Cluster Nodes
Under the Cluster Administration tab, you can see the list of LogScale nodes that display Size information: the green part of this bar is 'LogScale Data'; the darker grey is all your other data on that node; and the lighter grey is free space.
What LogScale Data means is this context is that it includes the compressed ingested data that is found within LogScale segment files. It means that it also includes the duplicated or replicated data, but doesn't include LogScale's own system data.
Storage Metrics within the LogScale UI - Cluster Administration
Figure 291. Cluster Administration
Cluster Administration tab at
the top of the page under
Replication, you can find
information regarding your Replication. Perfect means the total size
of segment files that meet the replication factor.
Low is the total size of the segment files that
are less than the replication factor. Absent
means LogScale knows about these segment files but can't find them on
any of the nodes.
The total size that is displayed within these boxes includes the compressed ingested data that is found within LogScale segment files. It doesn't include additional duplicated or replicated data. Nor does it include LogScale's own system data.
Google Cloud Bucket Storage with Workload Identity
LogScale supports using Workload Identity for bucket storage and export to bucket of query results, rather than an explicit service account for Google Cloud Storage access.
To enable it, use the following configurations for bucket storage and export, respectively.
With these options enabled, the container service account will be used
for authentication rather than static keys. This configuration is
recommended as the best and most secure practice, therefore it takes
precedence over the usage of
The account applied for export requires the