A Data Source represents one stream of events from a log source. An
example is the
syslog from a server, or the access log
from a web server.
But how does Humio decide which data source a set of events belongs to? This is determined by inspecting the Event Tags. Thus a Data Source really is a set of Events that have the same Event Tags. Humio divides each Repositories into more than one Data Source based on the tags.
Humio creates Data Sources automatically when it encounters a new combination of Tags. Users cannot create Data Sources directly.
Humio represents each Data Source internally as a dedicated directory within the Repository directory. Each data source is stored separately on disk. Restricting searches to a data source means Humio does not have to traverse all data — which makes searches much faster.
Data Sources of a Repository
In the settings page of a repository you can see the list of data sources that have been created during ingest.
Deleting a Data Source
Data Sources are the smallest unit of data that you can delete. You can delete individual Events in a Data Source by using the Redact Events API. You can also set Data Retention on the repository to allow deleting events when they reach a certain age.
We recommend that you do not create more than 1,000 separate tags, or combinations of tags. If you need more combinations we recommend that you use attributes on individual events to differentiate them and select them separately.
For most use cases you can ignore the following section.
For performance reasons the amount of data that flow into each Data
Source is limited to approximately 5 MB/s on average. The exact amount
depends on how much a single CPU core can Digest. If a Data Source
receives more data for a while, then Humio turns on auto sharding. This
adds a synthetic tag value for the tag
#humioAutoShard to split the stream into multiple
data sources. This process is fully managed by Humio.
For optimal performance a Data Source should receive 1 KB/s or more on average. If you know you have many Data Sources in a repository that are slow in this respect, you can get better performance by turning on "Tag Grouping" on the tags that identify the slow streams.
Each Data Source requires Java heap for buffering while building the next block of data to be persisted. This amount to roughly 5 MB each. If you have 1,000 Data Sources (across all repositories, in total) on your Humio server, you will need at least 5GB of heap for that on top of the other heap being used. In a clustered environment, only the share of Data sources that are being "digested" on the node need heap for buffers. So more servers can accommodate more Data Sources in the cluster.