A Data Source represents one stream of events from a log source. An example is the
syslog from a server, or the access log from a web server.
But how does Humio decide which data source a set of events belongs to? This is determined by inspecting the Tagging. Thus a Data Source really is a set of Events that have the same tags. Humio divides each Repository into more than one Data Source based on the tags.
Humio creates Data Sources automatically when it encounters a new combination of Tags. Users cannot create Data Sources directly.
Humio represents each Data Source internally as a dedicated directory within the Repository directory. Each data source is stored separately on disk. Restricting searches to a data source means Humio does not have to traverse all data — which makes searches much faster.
In the settings page of a repository you can see the list of data sources that have been created during ingest.
Data Sources are the smallest unit of data that you can delete. You can delete individual Events in a Data Source by using the Delete Events API. You can also set Retention on the repository to allow deleting events when the reach a certain age.
For most use cases you can ignore the following section.
For performance reasons the amount of data that flow into each Data Source is limited to approximately 5 MB/s on average. The exact amount depends on how much a single CPU core can Digest. If a Data Source receives more data for a while, then Humio turns on auto sharding. This adds a synthetic tag value for the tag
#humioAutoShard to split the stream into multiple data sources. This process is fully managed by Humio.
For optimal performance a Data Source should receive 1 KB/s or more on average. If you know you have many Data Sources in a repository that are slow in this respect, you can get better performance by turning on “Tag Grouping” on the tags that identify the slow streams.
Each Data Source requires Java heap for buffering while building the next block of data to be persisted. This amount to roughly 5 MB each. If you have 1,000 Data Sources (across all repositories, in total) on your Humio server, you will need at least 5GB of heap for that on top of the other heap being used. In a clustered environment, only the share of Data sources that are being “digested” on the node need heap for buffers. So more servers can accommodate more Data Sources in the cluster.