LogScale High-Availability and Disaster Recovery (HA/DR) Implementation

LogScale's High-Availability and Disaster Recovery implementation ensures data resilience through a robust combination of Kafka integration and file system storage mechanisms, allowing for reliable data replication and recovery across different deployment configurations. The system provides flexible options for data redundancy through both bucket storage and non-bucket storage implementations, while maintaining data integrity during node failures through careful management of event processing, ingest operations, and search functionality.

This document outlines LogScale's approach to ensuring data resilience and high availability through its integration with Kafka and file system storage mechanisms. It explains how LogScale manages data replication, event processing, and storage across different deployment configurations, including implementations with and without bucket storage. The documentation covers the system's ability to maintain data integrity during node failures, the role of Kafka in temporary storage, and the replication processes for segment files. Key topics include ingest operations, digest and search functionality, and how the system handles data availability in various scenarios. The content specifically addresses how LogScale can be configured for different levels of data redundancy and disaster recovery, depending on whether bucket storage is utilized or not, and explains the operational requirements for maintaining system availability during both normal operations and failure scenarios.

LogScale uses a combination of replicated files and data in Kafka topics to make sure to not lose any data. LogScale relies on Kafka as the commit-log for recent and in-progress work, and local file systems (or Bucket Storage) as the trusted persistent long term storage.

When events or requests arrive on the HTTP API, the 200 OK message is issued only after Kafka has acknowledged all writes that stem from them. This delegates the responsibility of not losing incoming changes to Kafka which is a proven solution for this task.

When the digest nodes construct segment files from incoming events, they include the offsets from the Kafka partition the events went into as part of the meta data in Global. Segment files then get replicated to other nodes in the cluster. The cluster calculates the offset for each partition where all events are in files that are properly replicated and then tells Kafka that it is okay for Kafka to delete events older than that.

LogScale can be configured to be resilient against data loss in more than one way depending on the hosting environment.

In all cases the hosting environment must ensure that Kafka has sufficient replicas and fail-over nodes to support the desired resilience.

Without Bucket Storage

LogScale is able to retain all ingested events in the case of a single node loss if the digest and storage replication factors are set to 2 or higher. Events may be deleted from Kafka once the resulting segment files have been replicated to reach the configured factors.

Using Bucket Storage

With bucket storage enabled on top of ephemeral disks, LogScale is able to retain all ingested events even when only 1 single node is assigned to every partition: LogScale deletes ingested events from Kafka only once the resulting segment files have been uploaded to the bucket and the remote checksum of the uploaded file has been verified.

Ingest

Ingest can run as long as any LogScale node is reachable and provided that node is able to publish the incoming events to Kafka.

Digest and Search

Digest runs as long as the cluster coordinator is able to assign nodes to all digest partitions.

Searches that depend on recent data (live searches) restart to reflect changes in the active digester set.

With the default snapshot interval of 30 minutes for mini segment files, every change to the set of live digest nodes may require replaying 30 minutes of event data from Kafka. This implies that changes to the live set at shorter intervals may stall the system.

Digest and Search using Bucket Storage

Segment files in the bucket are available to any host in the cluster and the cluster can thus continue to operate as long as a sufficiently large subset of the nodes with digest partitions assigned are alive to manage the load.

Digest and Search without Bucket Storage

The segment files are available only on the nodes assigned in the partition tables for digest and storage. Some event data may become temporarily unavailable if all nodes on any partition are missing at the same time.

LogScale Internal Architecture

Design Principles

LogScale Logical Architecture

LogScale Operational Architecture

LogScale Physical Architecture