Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: High availability explanation page #940

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions docs/src/snap/explanation/high-availability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# High Availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: High availability


High availability (HA) is a core feature of {{ product }}, ensuring that
a Kubernetes cluster remains operational and resilient, even when nodes or
critical components encounter failures. This capability is crucial for
maintaining continuous service for applications and workloads running in
production environments.

HA is automatically enabled in {{ product }} for clusters with three or
bschimke95 marked this conversation as resolved.
Show resolved Hide resolved
more nodes independent of the deployment method. By distributing key components
across multiple nodes, HA reduces the risk of downtime and service
interruptions, offering built-in redundancy and fault tolerance.

## Key Components of a Highly Available Cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Key components of a highly available cluster


A highly available Kubernetes cluster exhibits the following characteristics:

### 1. **Multiple Nodes for Redundancy**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Multiple nodes for redundancy


Having multiple nodes in the cluster ensures workload distribution and
redundancy. If one node fails, workloads can be rescheduled on other available
nodes without disrupting services. This node-level redundancy minimizes the
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence or two explaining if rescheduling needs manual intervention or if it happens automatically?

impact of hardware or network failures.

### 2. **Control Plane Redundancy**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Control plane redundancy


The control plane manages the cluster’s state and operations. For high
availability, the control plane components—such as the API server, scheduler,
and controller-manager—are distributed across multiple nodes. This prevents a
single point of failure from rendering the cluster inoperable.

### 3. **Highly Available Datastore**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Highly available datastore


By default, {{ product }} uses **dqlite** to manage the Kubernetes
cluster state. Dqlite leverages the Raft consensus algorithm for leader
election and voting, ensuring reliable data replication and failover
capabilities. When a leader node fails, a new leader is elected seamlessly
without administrative intervention. This mechanism allows the cluster to
remain operational even in the event of node failures. More details on
replication and leader elections can be found in
the [dqlite replication documentation][dqlite-replication].

<!-- LINKS -->
[dqlite-replication]: https://dqlite.io/docs/explanation/replication
1 change: 1 addition & 0 deletions docs/src/snap/explanation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ channels
clustering
ingress
epa
high-availability
security
cis
```
Expand Down
Loading