LogScale Collector Sizing Guide

The numbers in this guide are based on measurements and experience from running the LogScale Collector in production. However, the actual needed size of your LogScale Collector instances depends on the workloads, and we recommend testing to determine those numbers.

See the following for more information:

Minimum Resource Recommendations

In the case where the LogScale Collector is used on a laptop or desktop for gathering systems logs, the requirements are quite sparse and the service running in the background should not be noticeable.

An example of such a setup could consist of:

  • The System and Application channels from the Windows Event Log source

  • Log files from your VPN

  • A cmdsource measuring the systems resource usage

In a scenario like this we recommend these resources as a minimum:

Resource Recommendation
Memory 4 GB
Disk 4 GB

Note

These numbers are conservative to account for peak buffer/queue usage. During normal operations with a working network connection etc. the actual memory consumption in a scenario like above would be below 100 MB.

Scaling

Generally speaking, the concurrency model behind the LogScale Collector automatically takes advantage of the systems CPU resources.

Source Throughput

Each source has different performance characteristics. The numbers for throughput are based on measurements but it will vary depending on your actual workload.

Source Throughput Notes
File 154 MB/s/vCPU Throughput of the file source is bound by disk and/or network I/O. This measurement was done with AWS io1 disks (64000 IOPS)
Journald 32 MB/s  
Syslog (TCP) 100 MB/s/vCPU The vCPUs are only utilized when multiple TCP connections are sending data to the LogScale Collector
Syslog (UDP) 26 MB/s The throughput is with UDP packages of size 1472 bytes.
Windows Event Logs 5 MB/s Measured average of around 3000 event/s. Currently the WinEventLog source does not scale automatically with numbers of vCPUs. To improve throughput, isolate high load channels to their own source in the configuration

1 vCPU = 1 ARM physical CPU or 0.5 Intel physical CPU with hyper-threading.

Sink Workers

In some high throughput scenarios the LogScale ingestion endpoint can be a bottleneck, meaning that the measured throughput of a LogScale Collector deployment is lower than expected given the table above.

In those cases it can be beneficial to increase the number of concurrent requests a sink is using to ship logs towards the LogScale ingestion endpoint.

The default number of concurrent network connections requests per sink is 4 and can be increased in the configuration, using workers:

yaml
sinks:
  my_sink:
    type: humio
    url: <..>
    token: <..>
    # Increases number of concurrent connections to LogScale to 8
    workers: 8

It should only be necessary to increase the number of workers when the bottleneck is the number of parallel requests. This can happen when an expensive parser is being used, causing the ingest requests to take longer.

The throughput of a sink is constrained by the time per request in the following function: maxBatchSize * workers * 1/timePerRequest. If the machine running the Log Collector is not the bottleneck, and LogScale has the capacity to process more requests in parallel, then the number of workers should be increased.

Note

Each worker keeps an internal buffer, starting at 16 MB per worker, which it uses to serialize requests. Therefore, increasing the number of workers also puts additional memory pressure on the Log Collector. If a larger pool of workers is specified than necessary, the Log Collector will also be using more memory than necessary.

Sink Workers Example

How many workers to use in any situation depends on the response time per request of the LogScale server, which in turn depends on the parser used, if requests are going to an on-prem or SaaS solution, the server configuration etc.

Description #
Goal 11 TB/day = 139 MB/s
Measured server response time 600 ms

Using the default and recommended batchSize of 16 MB, the theoretical limit per worker in this example is: 1/0.600s * 16 MB = 26.66 MB/s.

Thus, the number of workers should be: 139/26.66 = 5.2, rounded up to 6 workers.

This calculation is based on the assumption that data can be read fast enough from the source.

Memory

The memory requirement is linearly proportional to the number of sinks in the configuration plus a constant baseline requirement of 1 GB.

It should not be necessary to increase the default memory queue size. The purpose of the memory queue is to ensure that data is always readily available to the sink, such that the Log Collector can always be actively ingesting. Increasing the queue size is not going to increase the throughput of the sink. If the throughput of the sink is lower than that of the data that is being collected, the queue will eventually fill up.

The default queue size per sink is 1 GB and can be increased (or decreased) in the configuration:

yaml
sinks:
  my_sink:
    type: humio
    token: <..>
    url: <..>
    # Increases queue size to 2 GiB
    queue:
      type: memory
      maxLimitInMB: 2048

  another_sink:
    type: humio
    token: <..>
    url: <..>

The configuration above therefore has a total memory requirement of 1 GB (baseline) + 2 GB (my_sink) + 1 GB (another_sink) = 4 GB.

Back-filling

A running LogScale Collector which is able to deliver the logs continuously to LogScale would not normally use the resources listed above, however, some situations can cause log data to pile up - for instance if a machine is without internet connection for a while but still generates logs.

In such a scenario the LogScale Collector will back-fill the log data when an active internet connection is re-established. The internal memory buffers will fill up for efficient log shipping, and the utilization of the queue could reach 100% (This limit is by default 1 GB/sink).

In addition, if the LogScale Collector is unable to deliver the logs to the server fast enough or not at all, a large amount of memory could potentially be used.

For instance, if the LogScale Collector is tasked with back-filling 1000 large files, data will potentially be read into the systems faster than it can be delivered to the LogScale server, and in such an example the memory usage would rise to: 1 GB (baseline) + 1 GB (sink) + 1000 * 16 MB (internal buffer per file, one batch size) = 18 GB.

Disk

Disk size is only relevant if the disk queue is used. In most scenarios, When and if the disk queue makes sense depends on the deployment setup.

For instance the disk queue is unnecessary if the LogScale Collector is able to read back the data from a source in case of an interruption. This is the case for these sources: Windows Event Logs, journald and file sources. All these use a bookmarking system to keep track of how far data has been read and processed.

So, essentially the disk queue only makes sense for source where such a book keeping system is impossible, which at the moment only is the syslog source.

When using the disk queue, it is usually sufficient to keep 10 minutes worth of data is usually sufficient. So, if data flowing through a LogScale Collector deployment is averaging 40 MB/s, you should provision at least 24 GB of disk space (40 MB * 60 seconds * 10 minutes).

Example Deployments

Make sure your LogScale deployment is provisioned accordingly and meets the requirements for the ingestion amount. See Installing LogScale.

Large Syslog (TCP) deployment - 10TB/day
  • 10 TB/Day = 121.4 MB/s (121.4 MB/s)

  • (100 MB/s/vCPU) = 1.21 vCPUs, rounded up to 2 vCPUs

  • Recommended m6i.xlarge with 4 vCPUs to account for spikes in traffic and possible backpressure from network

Table: Large Syslog Source

Software Instances EC2 Instance Type / vCPU Memory Storage
LogScale Collector 1 m6i.xlarge / 4 16 GB gp2

Medium Windows Event Logs Deployment - 1 TB/Day
  • By isolating the ForwardedEvents channel to its own source in the configuration, it is possible to get a throughput of roughly 10 MB/s on an instance.

  • 1 TB/Day = 12.14 MB/s

  • (12.14 MB/s) / (10 MB/s/instance) = 1,2 instance rounded up to 2

Table: Medium Windows Event Source

Software Instances EC2 Instance Type / vCPU Memory Storage
LogScale Collector 2 m6i.large / 2 16 GB gp2

Large File Source Deployment - 1 TB/Day
  • 100 TB/Day = 1214 MB/s

  • (1214 MB/s) / (154 MB/s/vCPU) = 7,9 vCPUs, rounded up to 8.

  • Since 1214 MB/s is more than the max throughput of AWS io1 volumes of 1000 MB/s, we go with two instances.

Table: Large File Source

Software Instances EC2 Instance Type / vCPU Memory Storage
LogScale Collector 2 m6i.xlarge / 4 vCPU 16 GB io2