Ingestion: Kafka Phase

As incoming data has been received and parsed, the last step done by the ingestion layer is to put the data on a Kafka queue. The queue serves as the durable storage for the incoming data so that it can be processed and digested by LogScale.

Kafka was chosen because Kafka is:

Designed to process large volumes of data in a robust way
Horizontally scalable
Supports a durable queue mechanism
Highly configurable
Distributed with a built-in replication factor to spread data across multiple nodes

The response from Kafka as the durable storage mechanism is used to support acknowledge to clients when sending data:

When the Kafka queue indicates that the data has been accepted, the client gets a positive acknowledgement that the data has been accepted.
If data could not be put on the Kafka queue, an error is returned to the client.

The digest layer reads the data from the ingest queue and store it in LogScale's internal storage format, segment files.

For more information on how Kafka handles durability, see Kafka Semantics.

The distributed cluster model with built-in replication in Kafka allows you to configure how many nodes can be lost without losing data. Kafka can also be configured to set the number of nodes that should have received the data before acknowledging successful receipt. LogScale by default configures Kafka with the following settings:

Replication factor of 3 — i.e. data is copied to a minimum of three hosts
Two in-sync replicas — at least two nodes must have acknowledged receipt of the data before returning success to the client
Acknowledge all messages — messages on the queue must have been accepted and stored

Using this configuration, LogScale ensures that data accepted during ingestion has been reliably received and queued for digest, and ensures at least once semantics for the incoming data.

In worst case, data may be received multiple times. For example, a log shipper sends data, loses the network connection before receiving a response. Without an acknowledgement, the log shipper would resend the data, leading to LogScale storing the data twice.

LogScale Training

Beginner Introduction

LogScale Tutorials

LogScale Video Series

LogScale Overview

LogScale Internal Architecture

Ingestion: Kafka Phase

Enter search term