Datasources

Datasources are created by LogScale through a combination of tags on the data during ingestion, and the segments that are written to storage. The LogScale parser can define the tags during parsing.

Datasources are determined by mapping the groupings for different tag combinations. For example, if you have two tag fields, each with eight (8) different values there will be 64 distinct datasources.

A single datasource is handled by a single CPU thread within the system, therefore it is vital that the rate of data ingest for a datasource is carefully monitored to ensure that most efficient ingest rate. Each CPU assigned to a datasource is responsible for compressing and merging the segments for that datasource, with a single CPU typically able to process about 190GB/day. Optimizing the number of datasources for the ingest rate so that the work is distributed evenly across the CPUs and ingested data is vital.

The datasource is an important structure as it controls and influences both ingestion and searching.

A datasource defines the segment files that are used to store a tag combination. For example in the diagram below:

block-beta columns 8 A["#host=server1 #source=http.log"] block:block1:6 SegmentA1:1 space:2 SegmentA2:1 SegmentA3:1 end DS1["Datasource"] B["#host=server2 #source=http.log"] block:block2:6 SegmentB1:4 space:1 SegmentB2:1 end DS2["Datasource"] C["#host=server2 #source=loadbalance"] block:block3:6 SegmentC1 SegmentC2 SegmentC3 SegmentC4 SegmentC5 end DS3["Datasource"] D["#host=server3 #source=loadbalance"] block:block4:6 SegmentD1 SegmentD2 SegmentD3 SegmentD4 SegmentD5 SegmentD6 end DS4["Datasource"] blockArrowId6<["&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Time&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"]>(right):8 style DS1 fill:#fff,stroke:#fff; style DS2 fill:none,stroke:#fff; style DS3 fill:none,stroke:#fff; style DS4 fill:none,stroke:#fff;

Each of the tag combinations is an individual datasource, and each datasource has its own series of segments and timespan.

Data sources affect both the ingestion and searching of data:

  • During ingestion, tags affect how data flows through the Kafka (see Ingestion: Kafka Phase) and digest (see Ingestion: Digest Phase).

  • When searching, the datasource is used to limit the segments needed to return the search results. If the query includes a specific tag, the segments searched can be limited to the matching datasource for the selected tags.

Because of this combination of ingest and search impact for a datasource, you should choose your tags carefully to ensure that you are maximizing the ingest and query speed. The basic principles can be categorised as follows:

  • A higher number of tags will create a higher number of datasources, and this will increase the number of mini segments and segments in general. This increases the storage size on disk and increases the memory required by the Global Database to store the segment map.

  • A lower number of tags increases the number of segments that need to be accessed when searching, which may reduce performance.

Information is processed from the Kafka ingest queue to an individual digest node, which is then responsible for writing the information for a given datasource. In this situation, it is possible for a low-number of datasources to create a high volume of ingest for a given datasource and this can lead to performance delays during ingest, as seen in the figure below:

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% graph LR subgraph KQ ["Kafka Queues"] direction LR KQ1["Ingest Queue Partition 1"] KQ2["Ingest Queue Partition 2"] KQ3["Ingest Queue Partition 3"] KQ4["Ingest Queue Partition 4"] end subgraph D1 ["Digest Node 1"] direction LR DP1["Digest Partition Worker 1"] DS1A["Datasource 1A"] DS2A["Datasource 2A"] DS3A["Datasource 3A"] DP2["Digest Partition Worker 2"] DS1B["Datasource 1B"] DS2B["Datasource 2B"] DS3B["Datasource 3B"] end subgraph D2 ["Digest Node 2"] direction LR DP3["Digest Partition Worker 2"] DS1C["Datasource 1C"] DS2C["Datasource 2C"] DS3C["Datasource 3C"] end KQ1 --> DP1 KQ2 --> DP2 KQ3 --> DP3 DP1 --> DS1A DP1 --> DS2A DP1 --> DS3A DP2 --> DS1B DP2 --> DS2B DP2 --> DS3B DP3 --> DS1C DP3 --> DS2C DP3 --> DS3C

Increasing the number of tags, or picking a different value to use for tags, may help to distribute the information more effectively across the digest nodes and therefore alleviate contention during ingest.

Optimizing Tags and Datasources

Choosing tags and the resulting datasources that are created are both a local-repo and cluster-wide consideration:

  • At the repository level:

    • Avoid choosing tags that have a large range of values (high cardinality), a higher range of values leads to a higher number of datasources. This in turn, creates a higher number of mini-segments, segments and the size of storage on disk and the memory required to manage the segments wit hin the global database.

    • When a tag with too many values is used (and to catch accidentally specifying a field with a high cardinality), LogScale will monitor the number of datasources created from the tag combinations and limit the combinations during ingest. If the limit is reached, LogScale will add the #tooManyTagValueCombination=true and #error=true fields in the ingested data. The limit is configurable for each repository and for the whole cluster.

    • For each repository, the optimum consideration is the ingest data flow, which should target a range between 100KB/s and 2MB/s. Above this upper limit, LogScale will start to automatically shard the ingested data.

    • The search of data and the number of datasources used through the tag combinations can be influenced by the tag groupings, i.e. the hashing of the source values into a distinct value used to match against exact matches. For more information, see How-To: Using Tag Grouping.

  • At the cluster level:

    • The number of datasource across all the repositories has an impact on the memory requirements for each node in the cluster. This is because the number of datasources increases the meta data used to map the segments and datasources. A higher number of datasources, leads to higher memory requirements and this reduces the memory available for caching and accessing data.

    • The default maximum number of datasources for each repo can be set at a cluster level to prevent individual repositories from creating too many datasources and limiting the cluster operation.