Understanding Data Flows

To understand the different components and layout of the LogScale cluster, it can be useful to think of the different flows of data in the system.

Ingesting Data Flow

graph LR; L1[Log shipper] IN[[Ingest Nodes]] KQ[Kafka Queue] DN[Digest Nodes] SN[Storage Nodes] BS((Bucket Storage)) L1 --Ingest Data--> IN IN --> KQ KQ --> DN DN --Merged Segments--> SN SN <--Segments--> BS

When ingesting data from an external source, the flow of events into LogScale is as follows:

  1. A log shipper, the Falcon Log Collector or a raw API request, sends data to the ingest nodes.

  2. Ingest nodes parse the incoming data, identifying fields or ignoring information. Parsers can be customized to enable parsing and organizing different input formats of data, for example raw text, JSON, or XML, into explicit fields within LogScale. Parsing uses the same language as used when querying data for flexibility.

    The data is compressed ready for submission to the Kafka queue. The corresponding Kafka queue is configured by LogScale not to compress data.

  3. Parsed data is sent to a Kafka queue for processing.

  4. Digest nodes take the incoming Kafka queue data and organize and distribute the data to storage nodes storing the fragments of data in segments.

  5. Storage nodes store the data on the local disk and, optionally, also store the data into a bucket storage. The use of bucket storage enables large volumes of data to be stored, with older data only being available in bucket storage. This optimizes access to recent data on faster local SSD or ephemeral storage.

Incoming events are always ingested, parsed, and digested in the order which they are sent to Falcon LogScale. Events are not reprioritised at any time, even if an individual event contains an older timestamp.

Querying Data

graph LR; C1[Client] C2[Client] subgraph "LoadBalancer" direction TB LB[LB Node] LBB[LB Node] LBC[LB Node] end subgraph "LogScale" QCN[Query Coordination Nodes] SN[Storage Nodes] end BS((Bucket Storage)) C1 & C2 --Query Requests--> LoadBalancer LoadBalancer --> QCN QCN --Internal Query Requests--> SN SN <--> BS

When querying data, whether through the UI or an API, or an internal process, the data must be retrieved from the storage nodes, processed and if necessary formatted and summarized:

  1. All queries start with a time period, since all data is stored with a timestamp. The time is either explicit, or relative, for example the last 5 minutes or previous year. The timestamp identifies whether the data to be queried will be available locally on the storage nodes or requires data stored in longer-term bucket storage.

  2. The query is parsed using LQL. Because LQL includes syntax and functions for extracting data from key/value pairs, JSON or using regular expressions, this process might include accessing all the raw data and creating fields on the fly as well as using stored field data.

  3. Data is first filtered, reducing the number of events, then formatted and optionally aggregated. Because this process is operated over multiple nodes the data can be filtered, formatted and aggregated through a map/reduce process across all the events.

  4. When aggregating the data, the query coordination nodes collect the information and reassemble the event data as a new series of events.

Live Queries

graph LR; LS[Log shipper] IN[[Ingest Nodes]] KQ[Kafka Queue] SE[Save Events to Storage] PE[Push Events to Live Queries] L1[Live Query 1] L2[Live Query 2] L3[Live Query 3] L4[Live Query 4] LS-->IN IN-->KQ KQ-->PE KQ-->SE PE-->L1 PE-->L2 PE-->L3 PE-->L4

Live queries are used by the alerting mechanism within LogScale to provide instant notification when a query matches. This is useful when looking for security issues or to identify failures or faults and ensures that the results are identified as soon as possible. If LogScale relied on querying the data after it had been ingested there could be a delay of seconds before the matching events are identified.

When a live query (or live search) is created, the queries are managed by a query coordinator which polls the incoming events, with the events sent in parallel to the query coordinators at the same time as they are sent to the digest process.

  1. A log shipper, such as the Falcon Log Collector or a raw API request, sends data to the cluster where the request is processed by an ingest node.

  2. Ingest nodes parse the incoming data, identifying fields or ignoring information. Parsers can be customized to enable parsing and organizing different input formats of data, for example raw text, JSON, or XML, into explicit fields within LogScale. Parsing uses the same language as used when querying data for flexibility.

  3. Parsed data is sent to a Kafka queue for processing.

  4. Events are sent to both the Digest Nodes and the Live Queries.

  5. Each Live Query is executed on the incoming events.

  6. Each Query Coordinator polls the live query and processes the result, whether this is part of the a Dashboard, or Alert.

Viewing Dashboards

graph LR; C1[Client] C2[Client] subgraph "LoadBalancer" direction TB LB[LB Node] LBB[LB Node] LBC[LB Node] end subgraph "LogScale" QCN[Query Coordination Nodes] SN[Storage Nodes] UI[UI/API Nodes] end BS((Bucket Storage)) C1 & C2 --Query Requests--> LBB LoadBalancer --> UI UI <--> QCN QCN --Internal Query Requests--> SN SN <--> BS

Dashboards are composed of individual widgets that display query data in a visual format, such as a bar graph, gauge or simple table. A dashboard is a collection of these widgets that shows a time-synchronized view of the information.

Their operation is a modified version of Querying Data or Live Queries.

Dashboards can use either traditional queries or live queries to execute and return their results. Live queries provide the most up to date information. Viewing dashboard is an extension of the query process, with the information.

When viewing a dashboard:

  1. Each widget is associated with an explicit query that returns data in the right format for display by the widget.

  2. The query for each widget is executed using any supplied user arguments, and using the same, synchronized, time span.