LogScale Internal Architecture

LogScale is a log management product. The primary goal for LogScale is to ingest and support searching large volumes of timestamped data, typically from text-based logs and analytics data.

This guide to the LogScale architecture has been designed to provide successive levels of detail into how LogScale works. The first level of the detail provides an overview of the information designed to give you a good grounding of the main components, themes and capabilities. The next level goes into more detail about the specific processes, and the last level provides a detailed description of how LogScale stores and retrieves data.

This guide is split into three broad sections:

  • LogScale Logical Architecture

    Describes the logical architecture of LogScale and how these components support processing requests, ingest and store data, and process and return queries.

  • LogScale Operational Architecture

    Describes the operation architecture of LogScale, including the process and steps involved in storing, indexing and searching data and how LogScale parses, ingests, and assembles information to be efficiently processed.

  • LogScale Physical Architecture

    Describes the physical components of a typical LogScale cluster and how they work together to support the LogScale operational and logical features.

To start, read the Design Principles to understand why LogScale was developed and the problems it was designed to solve.

Each section contains a deeper level of description. If you a want a more detailed view of what is going on under the hood within LogScale, work through to the deeper sections. For an overview, read only the high level content.

Design Principles

Before looking at the architecture of LogScale it is important to understand the base principles that drove this architecture and decisions:

  • Scalability

    Log everything

    • Efficient ingestion

    • Hardware efficient

    • Cost efficient

    • Cloud or on-premise deployment

  • Flexible search

    Ask anything

    • Index Free

    • Expressive query language

    • Composable query language

    • Hash filters

  • Real time

    Experience data in real time using efficient streaming architecture

    • Data is ingested in real time

    • Data is queried during ingestion to enable live searches

    • Data is stored according to its age to allow for live, active, and long term, queries

The original design and intention for LogScale included the following factors:

  • LogScale is designed to be able to ask questions on data without there being a fixed schema for the data at the point the cluster is initiated and the data ingested.

  • This design also implies no fixed or predetermined indexes for the data.

  • The original use-case for LogScale was collecting different application logs, for example from micro services, web servers, routers, databases, queues, load balancers. These systems all have wildly different logging formats, outputs and levels of detail, and it was this primary problem that LogScale aims to solve.

The nature of querying this type of information is that you don't know the questions or queries that you are likely to ask of the data until the information has been ingested. The difference in formats and that you may want to correlate information across multiple logs to identify faults or performance issues, means that creating a predefined structure and index would be complex. Such a structure and index would also slow down ingestion, making it difficult to query the data live as it was ingested.

By removing the requirement for a highly structured and indexed system, it is possible to ask queries of data with a wide range of use cases. For example, how many errors have occurred on a specific node, across a specific application on multiple nodes, or how many orders have been raised by a customer.

By aligning all of the data stored with a timestamp, this provides a single point of correlation across all the logged data:

  • By cross-matching data across applications and logs by timestamp, events can be identified easily.

  • Data can be located easily by the timestamp.

  • Data can be returned using the timestamp of the events as the basis for the order of the data and the timestamp used to order the accessibility of the physical storage solution required to store the data.

    Local disks can be used for recent data, and slower solutions, such as Bucket storage (Amazon S3 for example) can be used for older, longer term storage.

  • Because data is ordered this way, the querying system is designed to provide results as soon as possible based on the data immediately available, and then continue searching older data from slower bucket storage.

The result is that data storage can be efficiently managed and returned; you get instant results for recent matches and over time the full result set for a given timespan.

LogScale has evolved from the original use case of targeting the Devops use cases and has been shown to just as effective for other forms of information, such as the security and threat hunting by concentrating the data from multiple logs and events and then correlating the event data.

To understand how LogScale achieve this, let's start by looking at LogScale Logical Architecture.