Data Sources
Part of our Foundational Concepts series:
Previous Concept Views
Next Concept: Use Case: Log Management
A Data Source represents one stream of events from a log source. An
example is the syslog
from a server,
or the access log from a web server.
But how does LogScale decide which data source a set of events belongs to? This is determined by inspecting the Event Tags. Thus a Data Source really is a set of Events that have the same Event Tags. LogScale divides each Repositories into more than one Data Source based on the tags.
LogScale creates Data Sources automatically when it encounters a new combination of Tags. Users cannot create Data Sources directly.
LogScale represents each Data Source internally as a dedicated directory within the Repository directory. Each data source is stored separately on disk. Restricting searches to a data source means LogScale does not have to traverse all data — which makes searches much faster.
Data Sources of a Repository
In the settings page of a repository you can see the list of data sources that have been created during ingest.
Deleting a Data Source
Data Sources are the smallest unit of data that you can delete. You can delete individual Events in a Data Source by using the Redact Events API. You can also set Data Retention on the repository to allow deleting events when they reach a certain age.
Note
We recommend that you do not create more than 1,000 separate tags, or combinations of tags. If you need more combinations we recommend that you use attributes on individual events to differentiate them and select them separately.
Advanced Topics
For most use cases you can ignore the following section.
For performance reasons the amount of data that flow into each Data
Source is limited to approximately 5 MB/s on average. The exact amount
depends on how much a single CPU core can Digest. If a Data Source
receives more data for a while, then LogScale turns on auto sharding.
This adds a synthetic tag value for the tag
#humioAutoShard
to split the
stream into multiple data sources. This process is fully managed by
LogScale.
For optimal performance a Data Source should receive 1 KB/s or more on average. If you know you have many Data Sources in a repository that are slow in this respect, you can get better performance by turning on "Tag Grouping" on the tags that identify the slow streams.
Each Data Source requires Java heap for buffering while building the next block of data to be persisted. This amount to roughly 5 MB each. If you have 1,000 Data Sources (across all repositories, in total) on your LogScale server, you will need at least 5GB of heap for that on top of the other heap being used. In a clustered environment, only the share of Data sources that are being "digested" on the node need heap for buffers. So more servers can accommodate more Data Sources in the cluster.
Part of our Foundational Concepts series:
Previous Concept Views
Next Concept: Use Case: Log Management