Data Ingestion Overview

Data ingestion and its setup is an essential step in using and configuring LogScale- how organizations choose to do both is predicated on the organization's needs and its tools, platforms, applications, and the data itself. No matter what option is selected, the process is designed to incorporate log data as quickly as possible and with as few blockers as possible, using several key tools and components. LogScale is then able to queue and store the data ready to be queried. After installing LogScale, logs are brought into a centralized repository with a user-selected system.

Logs are ingested according to a user's need and the data itself, and the method used for ingest is dependent on factors such as operating system (OS), log format, source(s), and more. The options for ingest include:

  • AWS S3 Bucket(s)

  • Azure Feed(s)

  • Falcon Log Collector and Other Log Shippers

  • LogScale Ingest Tokens

  • LogScale API

  • Falcon Data Replicator (FDR) Feeds

Note

It's important to note that LogScale is designed for primarily live data- data that is historical and therefore static has a different set of considerations and requirements: Backfilling Data

Ingest Methods

LogScale supports a variety of ingest methods and associated data, including the following:

  • Falcon LogScale Collector and Other Log Shippers

  • AWS S3 Bucket(s)

  • Azure Feed(s)

  • LogScale Ingest Tokens

  • LogScale API

  • Falcon Data Replicator (FDR) Feeds

Falcon LogScale Collector and Other Log Shippers

Log shippers are system tools that gather data from a server and send them to LogScale for analysis. They are built to support seamless data transfer, and account for common problems associated with the process that impact reliability and consistent performance. A user's application writes logs to a log file, then the log shipper reads and pre-processes the data before shipping the data using one of LogScale's Ingest API.

The Falcon LogScale Collector is capable of supporting multiple data sources, with data sources being defined as data points from which the data itself is collected. Each data source offers different capabilities, all of which are geared towards the unique needs of a user and their respective organizations.

Falcon LogScale Collector currently supports the following inputs or data sources:

  • Collecting Events from Files

  • Windows Events

  • Syslog Receiver

  • Exec Input

  • SystemD Logs on Linux

  • macOS Unified Logs

Third-Party Log Shippers

Log shippers allow users to transfer log files and metrics reliably. The benefits include retransmission if the transfer files, and batched messages.

Note

Third-party log shippers are generally recommended ONLY when a user's data is not supported by Falcon Log Collector, or the user's toolset already includes another tool that performs similar tasks.

Ingest API

LogScale's Ingest API can be useful in a variety of cases where there are challenges or restrictions to a user's capabilities. Use cases are defined more fully within the documentation.

Elasticsearch Bulk API

Elasticsearch's bulk API makes it possible to perform many index/delete operations in a single API call, greatly increasing the indexing speed.

Note

Ingest API and Elasticsearch Bulk API are generally recommended ONLY when a user is attempting to integrate logging with one of their own internal tools or software, and Falcon Log Collector is therefore redundant.

Amazon Web Service (AWS) Simple Storage Service (S3) Bucket(s)

AWS log data represent a powerful opportunity to gain insight into an organization's data. LogScale allows users to ingest and manage AWS log types from S3 buckets via Amazon Simple Queue Service (SQS), then generate alarms, alerts, and queries.

Common AWS Services
  • AWS VPC flow

  • CloudTrail

  • CloudWatch

Logs from AWS sources are ingested by LogScale via SQS and then funneled to S3 buckets. Ingest via SQS queue continues at scale, with an ingest schedule that reflects the number of incoming messages. This configuration does have some latency- this is dependent on factors from both the originator of the event log and from the user. If reconfiguration is necessary, scaling will be reset accordingly, also a consideration for latency.

Prerequisites
  • AWS access and knowledge proficiency

  • A log source that is properly configured according to the documentation

  • Access to LogScale

  • Appropriate permissions

For more information, see the documentation here: Ingest Data from AWS S3

Azure Feed(s)

LogScale continuously polls the Azure Event Hub, ingests data, and scales the ingest process based on the number of partitions configured in the event hub. LogScale users can ingest and manage logs from Azure Event Hubs and then leverage the result with queries, alerts, and alarms. Latency does occur between events occurring and their general availability, both from Azure Monitor itself and from the user.

Data Sources
Prerequisites

The following are prerequisites for this configuration. Users must have:

For more information, see the documentation here: Ingest Data from Azure Event Hubs

LogScale Ingest Tokens

Ingest tokens are unique strings that work in conjunction with endpoints to identify a repository and allow users to send data to that repository with appropriate authentication.

The ingest token allows LogScale to identify the repository that the data will be ingested into, and the parser that will be used to extract the fields and data from the original log files.

Important

Ingest tokens are tied to a repository, not a user. This provides a better way of managing access control and is more convenient for most use cases. For example, if a user leaves the organization or project, you do not need to re-provision all agents that send data with a new token. You also do not have to create fake user accounts. However, it's important to remember that because ingest tokens are tied to a repository and not a user, ingest tokens can't be used to query LogScale, read data, or log in.

Repositories can also have multiple ingest tokens. This helps route data and associate parsers- and since ingest tokens are tied to the repository and not a user specifically, they offer better access management.

Ingest tokens can also be used for different ingest methods. For more information, see Third-Party Log Shippers and Ingest API

To create a Repository Ingest Token, see Generate a Repository Ingest Token

LogScale Ingest API

The LogScale Ingest API is an opportunity to ingest data with unique requirements from unique situations. Users can think of this as an alternative to Falcon LogScale Collector, with several use cases:

  • When Falcon LogScale Collector is not currently supported, either due to system or platform requirements

  • When users don't have control over the formatting of event messaging

  • When users require backward compatibility with Splunk tools/scripts/collectors

  • When users have existing OpenTelemetry feeds to ingest into LogScale

  • When users need compatibility with tools that use Elasticsearch's Bulk API

Ingest APIs, ingest tokens, and HTTP endpoints work in conjunction with one another to provide a user with secure data that can be used confidently and consistently, and allows for data to be ingested from other data sources or tools directly from the output of other databases or as part of a customer's custom application. The Ingest API can be used directly and/or through one of the __ls__shortname__ provided APIs or software libraries.

For more information about ingest tokens, see the documentation here: Ingest Tokens

See the Ingest API reference page for more information. For a list of supported software, see the Software Libraries in the Appendix:

Falcon Data Replicator (FDR) Feeds

FDR feeds are a service that provide events and third-party data using Amazon S3 and Simple Queue Service (SQS) in JSON format. Unlike other options offered, FDR feeds perform regular data transfers rather than supporting a data stream. FDR feeds can also be used to store and access data with unique requirements, send Falcon data to other tools, and more.

Considerations

There are several considerations to examine when ingesting data using FDR feeds, including:

  • Users must have a subscription to both Falcon Data Replicator and Falcon Insight XDR

  • FDR feeds do not include all CrowdStrike API data, and does not replicate all CrowdStrike data

  • FDR is a data source that transfers data on a regular basis- there is not a real-time option, which might not be sufficient for a user's need.

Users who employ an FDR feed will be able to send compressed and batched data on a scheduled basis to an S3 bucket, then notified using an SQS queue. Batch sizes can vary, and batch data more than 7 days old is automatically deleted from S3 buckets provided by CrowdStrike (customer-owned buckets aren't subject to this retention period). This means that the data a user might store is dependent on these factors and the retention period.

For more information, see the documentation here: Ingesting FDR Data into a Repository

Parsing Data

Logs that are sent to LogScale for ingestion must be parsed before being stored in a repository, which is achieved with the use of parsers. Parsers enrich the incoming data, and are composed of both a script and parser settings- the script is written in CrowdStrike Query Language (CQL), and defines the transformation requirements for incoming events in order for them to become either a searchable event or a more searchable event.

Parser settings are intended to configure and adjust certain actions, like assigning events to a particular datasource (Event Tags) or removing certain fields to optimize operational costs. When data is parsed, it is put on a Kafka ingest queue and an acknowledgement is returned in the response to a client.

Parsers take data and extract fields that are then stored along with the original text. Data can be structured or unstructured, and users must specify what parser is to be used and in which repository the data is to be stored.

Parser Types

There are three options for a parser:

  • Pre-built parsers

  • Built-in parsers

  • Custom parsers

Pre-built Parsers

A number of built-in parsers exist that are built-in to LogScale and automatically available, these include parsers for syslog, key-value and JSON formats. For ingesting third-party data, LogScale supports a number of packages that include parsers, dashboards, and corresponding widgets and alerts.

These packages provide integration with 3rd-party systems and enable you to ingest and parse data without having to write a custom parser. Those parsers normalize ingested data to the CrowdStrike Parsing Standard (CPS, or more commonly referred to as Pa-Sta), a common set of fields that allows for data from different third-party systems such as firewalls or security systems to be normalized so that they can be queried in a consistent and common form. These packages and their respective parsers can be found in LogScale's Package Marketplace, where the available options are organized first by vendor, then by application or platform. These packages and their parsers have been built by CrowdStrike to reliably ingest data into LogScale for quick and easy processing.

If the standard parsers or integrated parsers do not support the data that is being ingested, a custom parser be written. This uses the CrowdStrike Query Language, the same as used for writing queries, and can be extracted and manipulated to format the data as it is ingested.

For more information, see the documentation here: Pre-built Parser from the Marketplace Package Marketplace

Built-in Parsers

Built-in parsers are provided by LogScale to parse common log formats, especially widely-used formats like accesslog, which is used for web servers like Apache and Nginx. Each parser is available via the UI- to access them, visit Repository > Parsers. To see how they might work, test data can be used to show the result of a user's selection. Note that it is important to check the supported regular expression and timestamp formats to ensure best results.

Built-in parsers are also a great place to start before considering creating a custom parser. Many built-in parsers will support the required need, and if not, they can also provide guidance and context when creating a new parser- users can copy and/or clone an existing parser, then update it with customizations to meet their specific needs.

LogScale provides the following parsers:

  • accesslog

  • audit-log

  • corelight-es

  • corelight-json

  • json

  • json-for-action

  • kv

  • kv-generic

  • kv-millis

  • serilog-jsonformatter

  • syslog

  • Syslog-utc

  • zeek-json

Each parser is designed to account for both common and unique factors associated with parsing data and common tools within a LogScale customer's toolset. JSON and kv (key-value) parsers are particularly practical solutions for a variety of different log files, because the log file output from different tools and software align with this format. Also, the syslog parser will handle the bulk of the system- and some application-level content on Linux.

For more information, see the documentation here: Built-in Parsers

Custom Parsers

Custom parsers are a recommended solution for users who have not found a suitable option within the pre-built and built-in parser options already available via LogScale. Each parser needs a parser script and appropriate settings to ensure the user's desired result. Users should also keep in mind the level of transformation required for the data to be usable.

Parser Script Goals

The goal for a parser script is to:

  • Extract the correct timestamp from the event

  • Set the fields you want to use frequently in your searches

Because the timestamp is used to find results within a certain time frame, ensuring the timestamp is present as part of the parser script is essential. This can be achieved by assigning the timestamp within your data to the @timestamp field. In certain cases, the timestamp is automatically assigned to the field @collect.timestamp- this happens when a user ingests data using Falcon Log Collector.

Another field that should be considered is @rawstring. Because @rawstring contains the original text of the event, having it present allows users to view an event in its original form and in its entirety. @rawstring also doesn't have a required format, making it easier to parse incoming data without ever referring to the field explicitly in the script. Many functions used for parsing will default to using @rawstring if no field is specified.

Setting fields in the parser is an optional step, but one that every user should consider carefully. Because fields can also be extracted at search time, the parser doesn't have to set every field you want to use- however, searching on fields that have been set by the parser is easier when thinking about it in terms of writing queries, and also provides better performance for search speed. The individual fields necessary for a parser will depend on the needs of the user.

For more information, see the documentation here: Parse Data Built-in Parsers CrowdStrike Parsing Standard (CPS) 1.1 Custom Parsers