Index-Free Data Storage

A key principle of the LogScale architecture is the index-free approach, as initially described on Design Principles. The key driver is the ability to ask new questions on the data without having to pre-determine or organize the information with indices to provide efficient access to information.

When considering a typical structured datastore architecture there are some expectations:

  • Fixed structure organized by fields and tables

  • Fixed relationships between tables using key fields

  • Indexing on fields to improve lookup performance

For example, a bank account model is shown below:

graph TD subgraph BankAccount AC["AccountNumber"] D["TransactionDate"] A["Amount"] end subgraph AccountIndex AI["Account Number"] end subgraph Query Q["SELECT SUM(Amount) FROM BankAccount GROUP BY AccountNumber"] end AC --> AccountIndex AI --> Q A --> Q

This approach presents some limitations when considering the types of data that can be stored in a structured datastore and that LogScale is designed to address:

  • What happens when the format of the incoming data does not match the structure?

    LogScale is designed to store log files from any application or service, and those lines may not even include a field structure. Different log files may have different structures, and therefore different fields and expectations. Not every log line will include an IP address, for example, so creating an index on that field will not benefit searching for those records.

  • Non-indexed data has to be scanned

    In a structured database, if the data being queried is not part of an index, then it requires scanning each row of the information. In a structured datastore this is a time consuming (and expensive) process.

  • What about data outside the expected structure?

    If you parsed the incoming and store the fields that you cold identify, a structured would throw away information it couldn't store. In log files, this could include an error message or system dump that would provide additional information during diagnosis of an issue. LogScale stores the original log line.

  • How do you process already stored, unstructured data?

    Structured datastores are not designed to re-process or parse the raw data. LogScale can parse the raw log lines when the data is queried in an efficient manner, enabling searches on data that is not stored in a structured or fixed field. In addition, LogScale can create new fields and structures during the query on the stored event data. This allows for complex formatting, summarizing and recombination of information without needing to re-ingest.

Building and using indexes in a data store works well when the data is highly structured and the types of questions that will be asked are already understood, as the data schema and indexes can be optimized to answer those questions.

LogScale is built for handling large volumes of data efficiently. Log management very write heavy but not queried often. When designing for a write heavy, the speed of data ingestion is critical.

In addition to these principles for ingesting, storing and preparing the data to be searched, LogScale also creates metadata for the incoming information to assist with searching, categorizing and organizing the data:

  • Time

    All events within the LogScale are stored with a timestamp, either based on the time of the original event (from the log file), or the time the event was ingested by LogScale, or both. LogScale stores the start and end time every segment and will only search the specified time interval. This efficiently limits the number of segments that need to processed.

  • Tags

    Events can be stored with a tag. Internally this is referred to as a datasource and often matches the original event log type, for example syslog or weblog. These can be set when ingesting the data and provide another method for efficiently identifying the segment files that need to be searched.

  • Hashfilters

    Hashfilters heavily optimise free text search and regular expression as well as searches specifying equality on a field value.

For more information on how search works, see Search Architecture.