Ingestion: Storage Phase

The final phase of ingestion is the storage of the data on disk so that it can searched and queried by the standard search and query mechanisms. The storage phase within LogScale actually handles multiple processes related to the recording of data:

  • Store the incoming event data in segment files on disk that were built during Ingestion: Digest Phase

  • Replicate individual segment files between LogScale nodes

  • Cache data (including reloading data from bucket storage)

  • Run historical searches by executing queries on local and cached (reloaded) data

LogScale optimizes the storage of ingested data, organizing the data stored by the timestamp of individual events, organized by the segments files they are written to. The most recent data is stored on faster, local disks, and older information is stored on secondary or bucket storage:

  • LogScale storage tiering works with three different storage targets:

    • A primary local disk. Typically this should be an SSD or equivalent virtual storage device

    • An optional secondary local disk. This could be a larger, slower, SSD of spinning drive or array.

    • Bucket storage, such as Amazon S3 or Google Cloud Storage. When bucket storage is configured, LogScale will automatically store the data on both the local and bucket storage so that there are additional copies of the information

    LogScale manages the data on these disks by automatically moving the data between the different storage locations. Primary storage is used for incoming data. If a query is executed against older data that only exists on bucket storage, the segments written to the bucket storage and retrieved from the bucket store and stored locally to perform the search.

  • LogScale can optional set a retention period on the stored data. This automatically removes data that exceeds the configured retention period. Since log and security data is most useful when querying and processing recent events, this allows for that older information to be automatically removed, optimizing the speed and performance of any active data.

  • LogScale uses compression to efficiently store as much information as possible, even accounting for the replication and retention of multiple copies of segment files.

A logical configuration on a LogScale node is shown below:

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% graph LR subgraph SN ["Storage Node"] PS["Primary Storage"] SS["Secondary Storage"] end BS["Bucket Storage"] style SS stroke-dasharray: 5 5; PS --OldEvents--> SS --OldEvents--> BS PS <--Query--> BS

Data is moved between these different storage methods automatically summarized in the table below:

When Full With no additional storage With Local Secondary Disk With Bucket Storage
Primary Local Storage Data is expired Moved to secondary disk Moved to bucket storage
Secondary Local Storage Data is expired - Moved to bucket storage

Data is always ordered by the timestamp of the data. For example:

  • If the primary local store fills up, and secondary storage is not configured, but bucket storage is configured, then:

    • Data will have been written to bucket storage when the segment files are completed

    • The oldest events on the local disk are removed.

  • If the primary local store fills up and neither secondary or bucket storage are configured:

    • The oldest events on the local disk are removed.

Bucket Storage

Bucket storage refers to any external object-based storage system including Amazon S3, Google Cloud Storage and MinIO (used in on-premise data centers). These systems are designed to be highly available and cost efficient with regards to storage costs and use with LogScale is recommended.

There are two main reasons for using bucket storage:

  • Have data available in an external system. This is great for disaster recovery and robustness.

  • Supports storage of more data than would be possible, practical or economical using only local disks.

Using bucket storage configures LogScale so that the most recent (by timestamp) data is stored in segments on local storage, and older data is stored in bucket storage. The Global Database stores the record of which segment stores which timespan of information, and where it is located:

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% block-beta columns 6 Local:2 space Bucket:3 B6["12:14:00"] B5["12:09:00"] B4["..."] B3["12:06:00"] B1["12:00:00"] B2["12:04:00"] style B4 fill:none,stroke:#fff;

For example, using a 16 node cluster and local 4TB disks the maximum storage capacity would be 64TB. Using Bucket storage, only the most recent 64TB of data would be stored on the local disks, with a theoretically unlimited amount stored on bucket storage.

Using bucket storage, a LogScale cluster can be considered as a stateless interface to the underlying data. Cluster nodes can be seen as hot caches holding data that is also available in bucket storage. This has also provides an advantage for high-availability and resilience of the cluster; if all the data is automatically stored in bucket storage, then recovering a node, or even the whole cluster, is possible using the stored bucket data.

Bucket Storage Processes

Bucket storage implies some specific rules when handling data. The decision to use bucket storage can be made at the repository level, allowing for different repositories to make use of the additional capacity that bucket storage offers.

There are three changes that occur when Bucket Storage has been enabled for a repository:

  • Writing Segments

    %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% graph LR subgraph DN ["Digest"] D["Digest"] end subgraph SN ["Storage"] PS["Local Storage"] end BS["Bucket Storage"] D --Write Segment--> PS D --Write Segment--> BS

    When Bucket Storage has been configured, when segments are written to storage they are written to both the local storage and bucket storage. Since bucket storage implies additional network and time overhead, writing to both locations allows time for the segment to be written to bucket storage before the local copy of the data would be removed.

  • When Local Disk Fills Up

    Providing the segments have been written to bucket storage, the local copy is removed from each node on which it is stored. The segment will be marked in the Global Database as existing on the configured bucket storage.

  • During a Query

    Because data is organized into segment files according to timestamp, and queries are executed according to a time span, LogScale can determine if the data being searched exists in segments stored on local or bucket storage. If the data is determined to be in a segment on bucket storage, LogScale automatically requests the segment from bucket storage so that the data can be queried. The recovery of data from bucket storage implies a delay during the query process, but searches a large volume of data.

From a user perspective, there will be unaware of the processing behind the scenes, or the shuffling and martially of data between locations.

Data Retention

LogScale allows you to configure the rules for data retention. For each repository it is possible to specify:

  • Retention based on time. For example set the retention to only keep data for 30 days

  • Retention based on original data size. For example: Allow for 1TB of incoming data

  • Retention based on the storage size, i.e. the compressed data size of the stored segment files.

All three retention settings can be set in combination, and if any of these limits is reached, old data (according to the event timestamp) will be removed.

The data retention settings are completely independent of the management of data between different storage targets. Movement (or removal) of data, say between primary and secondary storage is based on the available storage space and is designed to make the best use of available storage space. It's possible for data to be removed due to retention regardless of the available storage space.

Retention Considerations

Retention settings can be challenging due to the inconsistent nature of log events. It is possible for data volumes to suddenly spike due to increased activity or a system failure, and this can cause systems to generate large volumes of logs compared to when systems are running normally.

It is possible for spikes and fluctuations in recent data to trigger the deletion of older data faster than maybe expected for retention based on size.

Choosing the right retention is about finding a balance between the different retention configurations that does not fill up disks, store too much on bucket storage (and increase costs):

  • Relying on incoming log volumes is subject to variations in the quantity of logs generated; a large spike might cause older data to be removed earlier than needed

  • Using the size of the stored data is complex, because the compression factor can change based on the complexity of the incoming data.

  • Using the retention time does not allow for consideration of the size of the incoming and stored data.

LogScale configuration for retention advice is applied differently depending on the deployment environment.

For Cloud Customers

For cloud customers, only retention based on time is supported. This ensures that the customer has a fixed expectation of when data will be stored and can be queried. This also allows customers to manage their data retention for the data that is needed to be stored and allow for auditing considerations without having to worry about the storage capacity. We provide a number of dashboards to enable the customer to monitor their ingest usage.

For On-Prem Customers

For on-premise installations retention settings are the preserve of the cluster administrators. The choice of the retention can be configured according the their repository and storage availability.

Retention Deletion Process

When data is deleted due to a retention limit being reached, only full segment files that exceed the retention can be deleted. This may mean that data cannot be deleted because it is not within the configured timespan, and this in turn may cause LogScale to use more disk space than expected because the segments cannot be deleted.

Deletion during retention can also be postponed if bucket storage has been configured. This allows for data to be marked for removal during the retention processing, and removed from search results, but for data to be recovered before being fully deleted. For more information, see Configuration.

LogScale has different measures to avoid physically filling up the disk. The process can create some difficulties when handling the combination of retention and disk space management. When using bucket storage it is possible to specify how full local disks are allowed to be (for example 80%). When this disk utilisation is reached the cluster will start deleting local segment files since the data exists in bucket storage. When disk usage hits 95% usage the node will stop accepting new data.

LogScale makes a best effort attempt to manage the segment files within these, but in the event of a sudden spike in ingest usage it is possible to have the disk usage go above the limits and consume all the available local storage. This can happen if the ingest volume outpaces the ability to upload the segment files to bucket storage so that they can be deleted. For more information on how LogScale manages this, see Caching with Bucket Storage.

Caching with Bucket Storage

When running a cluster using bucket storage, LogScale uses two parameters to configure the approach for deciding which data to expire from the local disks:

  • Data will expire when disks become more full than the configured limit (default 80%). Configured via the LOCAL_STORAGE_PERCENTAGE configuration variable.

  • A second parameter configures how much of the newest data LogScale should try to keep local. This can be configured according to how many days of data LogScale should try to keep on local disks. Use the LOCAL_STORAGE_MIN_AGE_DAYS configuration variable.

When deciding which data to remove LogScale will try to keep the data from the last X days configured. After that a Least Recently Used (LRU) strategy (using the event timestamp) is used.

Important

The number of days to keep on local storage is set globally, not by repository.

As a general approach, LogScale should be configured so that it avoids keeping data locally for more time than there is actually disk space for. LogScale will always try to remove data that is over the storage limit.

When configuring your cluster, it is a good idea to have more disk space available than required for the retention period. This allows LogScale to use the remainder of the disk space for the LRU cache of data that needs to be stored or loaded from bucket storage during search. When data is fetched from bucket storage it is often searched multiple times until the user has finished the use case. for this to work well data should be kept local in the LRU cache.

Data Replication and High Availability

LogScale replicates data between nodes in the cluster by copying the segment files between nodes. The data is replicated for the purpose of high availability. More copies means more resilience, since a node can be lost without losing access to the data.

To achieve this, LogScale uses a configuration parameter to set the replication factor. The number sets the number of copies of data that will be created when a segment file is stored. Higher numbers increase redundancy and increase the number of nodes that can fail before losing the data.

Replication can also help to improve performance. With multiple nodes having a copy of the same segment data, it's possible for searching to be performed across the same data simltaneously across multiple nodes.

If LogScale is configured to use bucket storage, then the overall resilience is increased because a copy of every segment file is automatically stored in bucket storage.

The recommendation to improve redundancy within LogScale is:

  • Set a replication factor of at least 2, and better 3.

  • Enable bucket storage

  • If deployed within a cloud environment, use fast ephemeral storage and configure bucket storage; ephemeral disks can be lost but the data should always be available in bucket storage.

Keep in mind that during ingest, LogScale relies heavily on Kafka's high availability and durability. There is always a possibility that LogScale will lose data if Kafka loses data during the ingestion process.

To further enhance reliability within a cloud environment like AWS, LogScale supports explicitly stating the cloud zones to use when replicating data. With this setting, LogScale will replicate data across datacenters and availability zones so that information is not lost if a single zone within the cloud environmentfails. This feature relies on Kafka being configured with the same multi-zone configuration. Using this model requires at least 3 zones (datacenters) to survive losing a datacenter as a quorum of nodes is required to be online. For more information on configuring Kafka, see Kafka Architecture Documentation.

LogScale also supports different replication factor configurations for digest and storage:

  • The replication factor for digest configures how many replicas to keep for the mini segments on digest nodes before they are merged into full segments.

  • The replication factor for storage specifies the number of replicas for segments on storage nodes.

Higher replication factors for digest causes LogScale to be more robust to failures on digest nodes that hold the most current (unmerged) segment data. Typically there is much more data on storage node that holds the merged segment files, and therefore it is more expensive to keep the replication factor for storage high.

If you are using bucket storage, you can reduce the replication factor for storage nodes because the segment files will be automatically stored within the bucket store.

LogScale attempts to the keep the replication factor constant and to distribute the data evenly across the cluster. If a node goes offline, the other machines in the cluster will start replicating the under replicated segments after a configured a time interval. For example, with a replication factor of two, two machines in the cluster can fail in succession, providing the data was replicated to another node in the cluster before the second node failed.