Ingestion

Ingesting data in LogScale takes each raw log line from the client logs and parses the content before the information is written into a repository. LogScale can be queried using the raw log lines, but the log lines can also be parsed to extract key information, for example the source IP address or an error message. Once the data has been extracted into individual fields, the data can be queried using these specific fields to filter, correlate or summarize data.

Immutability of Data

Once ingested, the data in LogScale is immutable. Data in a repository can only be deleted under certain conditions and with specific elevated privileges:

  • By time — Data is automatically purged at the end of the designated retention period. See Data Retention.

  • By manual deletion of the repository — A user with sufficient permissions can delete an entire repository. See Delete a Repository or View.

  • By API — A user with specific privileges and administrative power over a repository can leverage the Redact API to remove specific data. Redact Events API.

All of the above actions can only be performed by authorized users with the specific mentioned permissions tied to specific repositories.

Parsing the data also allows for different log files from different systems and put them into the same structure (for example Apache HTTP, Windows ISS and NGINX access logs into the same format), but also allow the information to be augmented or formatted for easier processing.

Ingestion involves processing the incoming log lines, through a process of parsing the line to extract key information, and then creating an event with the parsed data.

Parsing Log Data

Parsing and ingestion of data converts each line from each log into a single discrete event.

Looking again at the original NGINX HTTP log line:

accesslog
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain1.com/?p=1" "Mozilla/5.0 (Windows NT 6.1)"

The structure of the data contains a lot of information that we can parse and extract:

block-beta columns 1 block:line A["47.29.201.179"] B["-"] C["-"] D["[28/Feb/2019:13:17:10 +0000]"] E["GET /?p=1 HTTP/2.0"] F["200"] G["5316"] H["https://domain1.com/?p=1"] I["Mozilla/5.0 (Windows NT 6.1)"] end space block:desc J["IP Address"] K["Auth"] L["Username"] M["Timestamp"] N["HTTP Request"] O["HTTP Response Code"] P["Response Size"] Q["Referrer"] R["Client"] end

Events

The LogScale parser can process this raw log line and create an event, a data structure that contains a number of distinct fields:

Field Value
@rawstring 47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] ...
@timestamp 28/Feb/2019:13:17:10 +0000
method GET
version 2.0
status 200
size 5316
url https://domain1.com/?p=1
user-agent Mozilla/5.0 (Windows NT 6.1)

An event is the smallest fragment of data in LogScale. Events are the basis of all storage and queries. Collections of events are stored in repositories. When querying data, the query is executed on a sequence of those events across a given time range.

Each event may have a different set of fields, a different schema, and this is valid. For example, in the HTTP log example there is no authentication or user, but there could be lines in the log that contain that information. LogScale does not use or require a fixed schema for storing the data, and you do not to define the data structure, validation or indexes before the data can be ingested.

A single repository may therefore contain multiple source log data consisting of different formats, and events. This flexibility enables you to query multiple log files simultaneously and the query language provides a powerful mechanism for filtering and formatting the data.

LogScale also stores the original @rawstring of the log line. This is because the parser may pick specific fields of information but not handle all the different formats. Because we don't want to worry about these differences or lose the data, that original log line is stored and can be queried and processed when searching.

Types of Event Field

There are typically three types of fields in any given event:

  • Metadata fields contain additional information about the event such as the time when the data was ingested, or the source file or host of the data.

  • Tag fields provide a powerful mechanism for classifying data that also influence search performance.

  • User fields contain the information provided and/or parsed from the original source data.

Metadata Fields

Each event has some metadata attached to it on ingestion; all events will have an @id, @timestamp, @timezone, and @rawstring field.

Metadata fields start with the @ symbol.

The two most important are @timestamp and @rawstring and will be described in detail below.

Tag Fields

Tag fields start within the # character.

Tag fields are used to define how events have been parsed and how they are physically stored and distributed. Tags can influence the speed and performance of queries by controlling how LogScale distributes data across the hosts in the cluster.

Users can associate custom tags as part of the parsing and ingestion process but their use is usually very limited. For example, the built-in tags #repo contains the repository name and #type stores the name of the parser used to process the original log file.

User Fields

Any field that is not a tag or metadata is a user field. They are extracted at ingest by a parser. Data is extracted either by using character separated value parsing, JSON parsing or regular expressions to identify key parts of each line.

LogScale represents the original text of the event in the @rawstring attribute.

By keeping the original data, nothing is thrown away during ingestion of the data. In fact, the role of parsing is to extract and augment the information rather than remove or simplify it. With the original text available, this allows you to do free-text searching across all logs and extract fields after the content has been located and filtered. This can allow for parts of the data you did not even know would be important to be identified and select at the point of querying the data, eliminating the need to understand or account for all the variants of log file structure.

You can read more about free-text search and extracting fields in the search documentation.

The timestamp of an event is represented in the @timestamp field. This field defines where the event is stored in LogScale's database and is what defines whether an event is included in search results when searching a time range.

The timestamp needs special treatment when parsing loglines during ingestion.

The timestamp of when an event was ingested is represented in the @ingesttimestamp field. The value is milliseconds-since-epoch. Searches can restrict the data they search using this timestamp.

Within the UI or through an API the timespan can be configured as part of the search criteria. Alternatively, a query can be written to explicitly cover a range, for example using:

logscale
@ingesttimestamp > X AND
          @ingesttimestamp < Y
Field #repo

All events have a special #repo tag that denotes the repository that the event is stored in. This is useful in cross-repository searches when using views.

Field #type

The type field is the name of the parser used to ingest the data. A single repository can have multiple parsers so that the repository can ingest different types and formats of data.