Data Ingestion Guidelines

In LogScale, everything starts with ingestion. The method of ingestion can impact how logs are presented to LogScale and the requirements of the parser. Therefore, a package can often be dependent on a particular method of ingestion.

The parser included in a package should aim to be as agnostic as possible to the method of ingestion, but this is not always practical or possible. Packages in the marketplace need to clearly specify their recommended ingest mechanism.

Preferred Ingest Methods

Different systems generate logs in many different ways and have different options for getting logs into external systems. The below general guidance applies in most cases and can be used to help choose the best option for ingesting logs to LogScale. In some instances there may be good reasons for not following the below recommendations based on peculiarities of the system generating the logs or other factors.

In general there are three different mechanisms in common use and these are summarised below, ordered in terms of simplicity and ease of setup and ongoing management.

  1. Push Logs Directly to LogScale

    The system we want logs from can send the logs directly to LogScale.

  2. Push with a Log Collector

    The system we want logs from can send them to some customer managed location for staging, and the logs are then pushed to LogScale via a LogScale Log Collector.

  3. Pull From Remote System and Push with LogScale Log Collector

    The system we want logs from only provides logs when they are requested (e.g. via an API). The logs need to be actively pulled to a customer managed location for staging, and are then pushed to LogScale via a LogScale Log Collector. This is often needed when collecting logs from SaaS solutions.

Push Logs Directly to LogScale

In general if a system has the built-in ability to push logs to an external log management platform, the two options are usually to send to a Splunk HEC (HTTP Event Collector) interface or an Elastic Bulk Import API.

Pushing the logs is ideal as it removes the need for any additional systems between the source and LogScale. This makes it easier to deploy and manage the ingestion of logs to LogScale.

LogScale supports ingest APIs that are broadly compatible and work in the same way as the Splunk HEC and Elastic Bulk. This means that if a product has the ability to push logs to Splunk HEC or Elastic Bulk it is nearly always the case that these can be configured with the relevant LogScale service details and used to ingest to LogScale. See documentation here for details of all LogScale ingestion endpoints Endpoints.

When considering which API to use for systems that can push logs the available options in the source system is the main constraint. From a performance perspective the LogScale HEC interface offers better performance than Elastic Bulk and should be used where available.

Push with a Log Collector

Many systems have the ability to write logs to a local destination (often in syslog or similar formats). To transport these logs to LogScale, a log collector is required. Though there are many different options available for which collector to use, a package should use the LogScale Log Collector unless there is a really good reason not to.

The LogScale Log Collector supports a wide range of host operating systems and collection configuration possibilities

Many systems have the ability to write logs to a local destination (often in syslog or similar formats). To transport these logs to LogScale, a log collector is required. Though there are many different options available for which collector to use, a package should use the LogScale Log Collector unless there is a really good reason not to.

The LogScale Log Collector supports a wide range of host operating systems and collection configuration possibilities. See Falcon Logscale Collector for more details.

If the LogScale Log Collector does not meet your needs or cannot be used (e.g. in serverless architectures), see the Recommendations for Using Your Own Collector for more information, and reach out to us if you still want to proceed with a different collector than the LogScale Log Collector.

Pull From Remote System and Push with LogScale Log Collector

If a system cannot push logs to either LogScale itself or a LogScale Log Collector, then an additional component is required to pull the logs from the remote system. This usually requires a custom script to authenticate to the remote system and pull the logs from an API.

A LogScale Log Collector is then used to send the logs to LogScale. See Push with a Log Collector for more information.

Due to the complexity of managing additional components, this mechanism of collecting logs is generally the least preferred, but is sometimes the only option available.

Recommendations for Using Your Own Collector

When sending logs to LogScale, it may be tempting to employ custom scripting implemented either as a serverless function (e.g. AWS Lambda) or as a server-based script running on a managed machine. We strongly recommend running a LogScale Log Collector on a managed machine if you have the option to do so, instead of building your own log collector mechanism, but we also recognize that this is not always viable.

If you cannot use the LogScale Log Collector for some reason, we've gathered a list of different concerns that log collectors generally need to deal with.

  • Making sure logs are ingested

    When sending logs to the LogScale API, the logs should only be considered ingested when an HTTP 200 OK response is received back. Other responses may come for a variety of reasons, including: proxies that give redirect responses to be followed, a request may be corrupt, or LogScale itself may be overloaded or unreachable.

    If LogScale cannot receive data for some reason, it's reasonable for a collector to slow down sending of logs, and retrying any logs that don't make it in, to avoid data being dropped. To achieve this, a collector needs to be able to keep a buffer of logs to send. If logs are being read from a file, that includes knowing which logs have been read from the file and which have not.

  • Being observable

    When the logs are being sent, it can be difficult to figure out if any logs are missing when they reach LogScale. If you're missing some logs, maybe the program that generated the logs just didn't behave as you expected and didn't generate them. Or maybe those logs were dropped somewhere along the way? This makes it important that a log collector is observable, so you can tell if it's doing something unexpected.

    For example, since log collectors often read from log files while those files are also being written to, concurrency can easily become a source of logs being sent in a corrupted state or getting dropped completely.

  • Transportation

    When sending logs to LogScale, it's important to follow the guidelines for request sizes and frequency of requests, and to make sure that transportation is happening securely and cost efficiently, with solid encryption and compression, see Ingest via API Best Practices for more information.

  • Cost

    Given the concerns so far, writing a robust log collector is not necessarily easy, and can require non-trivial development time to get right. But cost is not just about creating and maintaining a collector, but also running it. Here, creating a serverless function to use as a collector can be alluring, because they become cheaper to run than a full virtual machine for certain workloads. But this depends a lot on how often data needs to be moved.

    The cost of serverless functions scale very well with work that happens in bursts, so if the logs are coming in bursts too, it may make sense. But if the logs are coming in a constant stream, having a full virtual machine running continuously may be cheaper.

Additionally, some of the concerns listed here require keeping some form of state to implement robustly. This can make them a bit more tricky to implement in a serverless function, since they are ephemeral. On the other hand, serverless functions can scale better if the amount of logs suddenly spikes and the system that's pulling the logs can't keep up.