Implement HA control plane with Load Balancer (3-3 topology)
Some checks failed
Deploy Cluster / Terraform (push) Failing after 10s
Deploy Cluster / Ansible (push) Has been skipped

Major changes:
- Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33)
- Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API
- Terraform: Add kube_api_lb_ip output
- Ansible: Add community.network collection to requirements
- Ansible: Update inventory to include LB endpoint
- Ansible: Configure secondary CPs and workers to join via LB
- Ansible: Add k3s_join_endpoint variable for HA joins
- Workflow: Add imports for cp-2, cp-3, and worker-3
- Docs: Update STABLE_BASELINE.md with HA topology and phase gates

Topology:
- 3 control planes (cx23 - 2 vCPU, 8GB RAM each)
- 3 workers (cx33 - 4 vCPU, 16GB RAM each)
- 1 Load Balancer (lb11) routing to all 3 control planes on port 6443
- Workers and secondary CPs join via LB endpoint for HA

Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)
This commit is contained in:
2026-03-23 02:39:39 +00:00
parent 8b4a445b37
commit ff31cb4e74
10 changed files with 89 additions and 21 deletions

View File

@@ -4,8 +4,9 @@ This document defines the current engineering target for this repository.
## Topology
- 1 control plane
- 2 workers
- 3 control planes (HA etcd cluster)
- 3 workers
- Hetzner Load Balancer for Kubernetes API
- private Hetzner network
- Tailscale operator access
@@ -13,6 +14,8 @@ This document defines the current engineering target for this repository.
- Terraform infrastructure bootstrap
- Ansible k3s bootstrap with external cloud provider
- **HA control plane (3 nodes with etcd quorum)**
- **Hetzner Load Balancer for Kubernetes API**
- **Hetzner CCM deployed via Ansible (before workers join)**
- **Hetzner CSI for persistent volumes (via Flux)**
- Flux core reconciliation
@@ -26,7 +29,6 @@ This document defines the current engineering target for this repository.
## Out of Scope
- HA control plane
- public ingress or DNS
- public TLS
- app workloads
@@ -35,21 +37,28 @@ This document defines the current engineering target for this repository.
## Phase Gates
1. Terraform apply completes for the default topology.
2. k3s server bootstrap completes with external cloud provider enabled.
3. **CCM deployed via Ansible before workers join** (fixes uninitialized taint issue).
4. Workers join successfully and all nodes show proper `providerID`.
5. Flux source and infrastructure reconciliation are healthy.
6. **CSI deploys and creates `hcloud-volumes` StorageClass**.
7. **PVC provisioning tested and working** (validated with test pod).
8. External Secrets sync required secrets.
9. Tailscale private access works.
10. Terraform destroy succeeds cleanly or via workflow retry.
1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB).
2. Load Balancer is healthy with all 3 control plane targets.
3. Primary control plane bootstraps with `--cluster-init`.
4. Secondary control planes join via Load Balancer endpoint.
5. **CCM deployed via Ansible before workers join** (fixes uninitialized taint issue).
6. Workers join successfully via Load Balancer and all nodes show proper `providerID`.
7. etcd reports 3 healthy members.
8. Flux source and infrastructure reconciliation are healthy.
9. **CSI deploys and creates `hcloud-volumes` StorageClass**.
10. **PVC provisioning tested and working**.
11. External Secrets sync required secrets.
12. Tailscale private access works.
13. Terraform destroy succeeds cleanly or via workflow retry.
## Success Criteria
**ACHIEVED** - Two consecutive fresh rebuilds passed all phase gates with no manual fixes:
**ACHIEVED** - HA Cluster with CCM/CSI:
- Build 1: Initial CCM/CSI deployment and validation (2026-03-23)
- Build 2: Full destroy/rebuild cycle successful (2026-03-23)
The platform is now stable with cloud provider integration and persistent volume support.
🔄 **IN PROGRESS** - HA Control Plane Validation:
- Build 3: Deploy 3-3 topology with Load Balancer
- Build 4: Destroy/rebuild to validate HA configuration
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes.