Best Practice: Tags and Datasources

Note

For more information on how LogScale creates and uses datasources see Datasources; for iinformation on the effects during ingest see Tag Fields and Datasources.

When choosing the tags to be created during parsing and ingest:

  • A tag field field should be easy for a user to use at the start of their query (the first field in their query)

  • A tag field should be able to be used in more than one search

  • Choose a tag and field value that will assist with the distribution of data across segments and also enable effective searching and selection. For example, when storing security logs, suitable tags might be the log source, entry type and host, as these are the most likely fields to be used when searching and filtering for data.

  • Avoid using fields or values that have a high cardinality. For example, don't use a tag value with more 1000 unique values as this will increase the number of segments and memory required to manage them.

  • The system default for the maximum number of datasources is 10,000 per repo. This limit implies a notional maximum number of reps in the cluster of between 10 and 20 repositories due to the management overhead.

  • When there are a higher number of tags, make use tag groupings to create a fixed number of hash the tag groups across the distinct values. Tag groups create a distinct hash value for groups of tags improving searches for specific values.

Examples of recommended tag fields and types of data include:

  • #host

  • #source

  • #environment

  • #applicationName

  • #serviceType

  • #eventType

The number of datasources for a repository can be monitored using the Data sources page or using the /api/v1/repositories/$REPOSITORY_NAME/max-datasources REST API endpoint.

Identifying Too Many Datasources

If the number of datasources created reaches the limit for the repository (default 10,000), during ingest LogScale will add two fields to the data:

fieldvalue
#tooManyTagValueCombinationtrue
#errortrue

Auto-sharding

LogScale will automatically shard data by monitoring the ingest rate for a given datasource (default 2MB/s, or 190GB/day) and then sharding the incoming data to optimize for that ingest rate. The sharding operates across each datasource and can be configured for each datasource combination. For more infomation on autosharding, see Configure Auto-Sharding for High-Volume Data Sources.