Resilience

Overview

Axual Kafka platform ensures a reliable service with little to no downtime. Aside from major infrastructural failures, a properly configured Kafka cluster will be available at any moment, allowing for mission-critical workloads.

Axual Platform

The Axual Self-Service is not considered critical, because it is only involved in modifications of the Kafka clusters and contains no business data. On a rare occasion of self-service downtime, Kafka and connected applications will be completely unaffected.

Geo Replication

In case the resilience of Kafka or the underlying infrastructure is not considered sufficient, it is possible to use other Kafka clusters as a backup. At Axual we call multiple in-sync Kafka clusters an "Instance", that holds the same business data with identical topic names, allowing applications to switch from one cluster to another rapidly. The mechanism used for this geo replication is the Axual Distributor and it can synchronize messages, offsets and schemas on a low latency.

Kubernetes mechanisms

The mechanisms that allow Axual Platform to be resilient against service disruptions are all part of Kubernetes. By spreading out workloads over multiple Worker Nodes, managed by multiple Controller Nodes, Kubernetes provides a fail-safe architecture.

Pods

Within Kubernetes, most workloads run in Pods that form the backbone of a Service. These Pods are managed by Deployments or StatefulSets that ensure that the required number of Pods is in a Running State. If too few Pods are performing well, Kubernetes will spin up additional Pods automatically.
Pods themselves can be built to deal with disruptions and should be built to have meaningful readiness probes that tell the Kubernetes API if the Pod is Running and ready to receive traffic.
To divide the Pods over multiple Nodes, (Anti-)Affinity Rules can be used or Pod Topology Spread Constraints.
When upgrading the underlying Kubernetes Nodes, defining proper Pod disruption budgets ensure a Service is always available.

Data

For workloads that persist data, data should be stored on multiple locations (Availability Zones, DCs or Racks) to avoid data loss when one of these locations is gone. For example, spread Kafka topic data over multiple topology.kubernetes.io/zone.