Ingestion: Kafka Phase

As incoming data has been received and parsed the last step done by the ingestion layer is to put the data on a Kafka queue. The queue serves as the durable storage for the incoming data so that it can be processed and digested by LogScale.

Kafka was chosen because Kafka is:

  • Designed to process large volumes of data in a robust way.

  • Horizontally scalable

  • Supports a durable queue mechanism

  • Highly configurable

  • Distributed with a built-in replication factor to spread data across multiple nodes

The response from Kafka as the durable storage mechanism is used to support acknowledge to clients when sending data:

  • When the Kafka queue indicates that the data has been accepted, the client gets a positive acknowledgement that the data has been accepted.

  • If data could not be put on the Kafka queue, an error is returned to the client.

The digest layer reads the data from the ingest queue and store it in LogScale's internal storage format, segment files.

For more information on how Kafka handles durability, see Kafka Semantics.

The distributed cluster model with built-in replication in Kafka allows you to configure how many nodes can be lost without losing data. Kafka can also be configured to set the number of nodes that should have received the data before acknowledging successful receipt. LogScale by default configures Kafka with the following settings:

  • Replication factor of 3 — i.e. data is copied to a minimum of three hosts

  • Two in-sync replicas — at least two nodes must have acknowledged receipt of the data before returning success to the client

  • Acknowledge all messages — messages on the queue must have been accepted and stored

Using this configuration, LogScale ensures that data accepted during ingestion has been reliably received and queue for digest, and ensures at least once semantics for the incoming data.

In the worst case, data may be received multiple times. For example, a log shipper sends data, loses the network connection before receiving a response. Without an acknowledgement, the log shipper would resend the data, leading to LogScale storing the data twice.