Best Practice: Tags and Datasources
Note
For more information on how LogScale creates and uses datasources see Datasources; for information on the effects during ingest see Tag Fields and Datasources.
When choosing the tags to be created during parsing and ingest:
A tag field field should be easy for a user to use at the start of their query (the first field in their query)
A tag field should be able to be used in more than one search
Choose a tag and field value that will assist with the distribution of data across segments and also enable effective searching and selection. For example, when storing security logs, suitable tags might be the log source, entry type and host, as these are the most likely fields to be used when searching and filtering for data.
Avoid using fields or values that have a high cardinality. For example, don't use a tag value with more 1000 unique values as this will increase the number of segments and memory required to manage them.
The system default for the maximum number of datasources is 10,000 per repo. This limit implies a notional maximum number of reps in the cluster of between 10 and 20 repositories due to the management overhead.
When there are a higher number of tags, make use tag groupings to create a fixed number of hash the tag groups across the distinct values. Tag groups create a distinct hash value for groups of tags improving searches for specific values.
Examples of recommended tag fields and types of data include:
#host
#source
#environment
#applicationName
#serviceType
#eventType
The number of datasources for a repository can be monitored using the
Data sources
page or
using the
/api/v1/repositories/$REPOSITORY_NAME/max-datasources
REST
API endpoint.
Identifying Too Many Datasources
If the number of datasources created reaches the limit for the repository (default 10,000), during ingest LogScale will add two fields to the data:
field | value |
---|---|
#tooManyTagValueCombination | true |
#error | true |
Auto-sharding
LogScale will automatically shard data by monitoring the ingest rate for a given datasource (default 2MB/s, or 190GB/day) and then sharding the incoming data to optimize for that ingest rate. The sharding operates across each datasource and can be configured for each datasource combination. For more information on autosharding, see Configure Auto-Sharding for High-Volume Data Sources.