Ingest Flow

 

Part of our Foundational Concepts series:

When data arrives at Humio it needs to be processed. The journey data takes from arriving at a Humio node until it is presented in search results and saved to disk is called the ingest flow.

If you are planning a large system or tuning the performance of your Humio cluster it can help to understand the flow of data. If you understand the different phases of the ingest flow you can ensure that the right machines have the optimal hardware configuration.

In this section we’ll explain the different ingest phases and how nodes participate.

Parse, Digest, Store

There are three phases incoming data goes through:

graph LR; Data{"Data"} --> Parse Parse --> Digest Digest --> Store
  • Parse: Receiving messages over the wire and processing them with parsers.

  • Digest: Building segment files and buffering data for real-time queries.

  • Store: Replicating the segment files to designated storage nodes.

These phases may be handled by different nodes in a Humio cluster, but any node can take part in any combination of the three phases.

The Parse Phase

graph LR; Data{"Data"} --> Parse Parse --> Digest Digest --> Store style Parse fill:#2ac76d;

When a system sends data (logs) to Humio over one of the Ingest APIs or through an ingest listener the cluster node that receives the request is called the arrival node. The arrival node parses the incoming data (using the configured parsers and puts the result (called events) in Humio’s humio-ingest Kafka topic.

If you are not familiar with Kafka, you can think of a topic as a set of independent queues.

graph LR; subgraph External Ext1[External Service] end subgraph Ingest Ext1 --> A1("<b>Arrival Node</b>") end subgraph Kafka P1["Partition <em>#1</em>"] A1 --> P2["Partition <em>#2</em>"] PN["Partition <em>#N</em>"] end

The events are now ready to be processed by a Digest Node.

The Digest Phase

graph LR; Data{"Data"} --> Parse Parse --> Digest Digest --> Store style Digest fill:#2ac76d;

After the events are placed in the humio-ingest topic a Digest Node will grab them off the topic as soon as possible. A topic in Kafka is configured with a number of partitions (parallel, independent queues), and each such Kafka partition is consumed by a digest node. A single node can consume multiple partitions and exactly which node handles which digest partition is defined in the cluster’s Digest Rules. Note that while only a single digest node consumes a Kafka partition at a time, Digest Rules allow you to specify multiple nodes per partition. The extra nodes act as fallbacks in case the primary digest node goes offline.

Constructing Segment Files

Digest nodes are responsible for buffering new events and writing segment files containing the received events.

Events are written into minisegments, which are moderately sized segment files. These files are written to the digest node’s local disk, and replicated onto all the digest nodes responsible for the digest partition the events originated from, as specified in the Digest Rules. While the segment is in this phase, queries against it will be executed on the digest nodes that have a copy of the segment.

When enough minisegments have been written, they are merged into larger segment files and passed on to Storage Nodes in the Store Phase.

Real-Time Query Results

Digest nodes also process the Real-Time part of search results. Whenever a new event is pulled off the humio-ingest topic, the digest node examines it and updates the result of any matching live searches that are currently running. This is what makes results appear instantly after events arrive in Humio.

The Store Phase

graph LR; Data{"Data"} --> Parse Parse --> Digest Digest --> Store style Store fill:#2ac76d;

The final phase of the ingest flow is moving segment files to storage nodes. Once a segment file has been completed in the digest phase, it is moved to the storage nodes. Your cluster’s Storage Rules define which nodes receive completed segments for storage, and how many copies of the segments should be stored. Queries against the completed segments will execute on the storage nodes that have a copy of the relevant segment.

Detailed Flow Diagram

Now that we have covered all the phases, let’s put the pieces together and give you a more detailed diagram of the complete ingest flow:

graph LR; subgraph External Systems Ext1[Data Producer] Ext2[Log System] end subgraph Humio Cluster subgraph Parse A1("<b>Arrival Node <em>#1</em></b><br><em>Parsing</em>") A2("<b>Arrival Node <em>#2</em></b>") AN("<b>Arrival Node <em>#N</em></b>") Ext1 -->|Ingest| A1 Ext2 -->|Ingest| A1 end subgraph Kafka A1 -.-> P1["Partition <em>#1</em>"] A1 --> P2["Partition <em>#2</em>"] A1 --> P3["Partition <em>#3</em>"] A1 -.-> PN["Partition <em>#M</em>"] end subgraph Digest P2 --> D1("<b>Digest Node <em>1</em></b><br><em>Real-Time Query Processing<br>Builds Segment Files</em>") P3 --> D1 D2("<b>Digest Node <em>#2</em></b>") DN("<b>Digest Node <em>#P</em></b>") end subgraph Store D1 --> S1("<b>Storage Node <em>#1</em></b>") D1 --> S2("<b>Storage Node <em>#2</em></b>") D1 --> SN("<b>Storage Node <em>#Q</em></b>") end end classDef faded opacity:0.3; class A2,AN,P1,PN,D2,DN,S1 faded;

The diagram shows a more detailed view of the ingestion process with two external systems sending data to Humio. The incoming data is first parsed by one of the Arrival Nodes then put on the ingest queue for a Digest Node. The Digest node writes the data to segment files and finally the segment files are sent to Storage Nodes to be saved to disk.

 

Part of our Foundational Concepts series: