Design Principles

Before looking at the architecture of LogScale it is important to understand the base principles that drove this architecture and decisions:

  • Scalability

    Log everything

    • Efficient ingestion

    • Hardware efficient

    • Cost efficient

    • Cloud or Self-Hosted deployment

  • Flexible search

    Ask anything

    • Index Free

    • Expressive query language

    • Composable query language

    • Hash filters

  • Real time

    Experience data in real time using efficient streaming architecture

    • Data is ingested in real time

    • Data is queried during ingestion to enable live searches

    • Data is stored according to its age to allow for live, active, and long term, queries

The original design and intention for LogScale included the following factors:

  • LogScale is designed to be able to ask questions on data without there being a fixed schema for the data at the point the cluster is initiated and the data ingested.

  • This design also implies no fixed or predetermined indexes for the data.

  • The original use-case for LogScale was collecting different application logs, for example from micro services, web servers, routers, databases, queues, load balancers. These systems all have wildly different logging formats, outputs and levels of detail, and it was this primary problem that LogScale aims to solve.

The nature of querying this type of information is that you do not know the questions or queries that you are likely to ask of the data until the information has been ingested. The difference in formats and that you may want to correlate information across multiple logs to identify faults or performance issues, means that creating a predefined structure and index would be complex. Such a structure and index would also slow down ingestion, making it difficult to query the data live as it was ingested.

By removing the requirement for a highly structured and indexed system, it is possible to ask queries of data with a wide range of use cases. For example, how many errors have occurred on a specific node, across a specific application on multiple nodes, or how many orders have been raised by a customer.

By aligning all of the data stored with a timestamp, this provides a single point of correlation across all the logged data:

  • By cross-matching data across applications and logs by timestamp, events can be identified easily.

  • Data can be located easily by the timestamp.

  • Data can be returned using the timestamp of the events as the basis for the order of the data and the timestamp used to order the accessibility of the physical storage solution required to store the data.

    Local disks can be used for recent data, and slower solutions, such as Bucket storage (Amazon S3 for example) can be used for older, longer term storage.

  • Because data is ordered this way, the querying system is designed to provide results as soon as possible based on the data immediately available, and then continue searching older data from slower bucket storage.

The result is that data storage can be efficiently managed and returned; you get instant results for recent matches and over time the full result set for a given timespan.

LogScale has evolved from the original use case of targeting the DevOps use cases and has shown to be just as effective for other forms of information, such as the security and threat hunting by concentrating the data from multiple logs and events and then correlating the event data.

To understand how LogScale achieves this, let's start by looking at LogScale Logical Architecture.