Files
HetznerTerra/STABLE_BASELINE.md
MichaelFisher1997 ff31cb4e74
Some checks failed
Deploy Cluster / Terraform (push) Failing after 10s
Deploy Cluster / Ansible (push) Has been skipped
Implement HA control plane with Load Balancer (3-3 topology)
Major changes:
- Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33)
- Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API
- Terraform: Add kube_api_lb_ip output
- Ansible: Add community.network collection to requirements
- Ansible: Update inventory to include LB endpoint
- Ansible: Configure secondary CPs and workers to join via LB
- Ansible: Add k3s_join_endpoint variable for HA joins
- Workflow: Add imports for cp-2, cp-3, and worker-3
- Docs: Update STABLE_BASELINE.md with HA topology and phase gates

Topology:
- 3 control planes (cx23 - 2 vCPU, 8GB RAM each)
- 3 workers (cx33 - 4 vCPU, 16GB RAM each)
- 1 Load Balancer (lb11) routing to all 3 control planes on port 6443
- Workers and secondary CPs join via LB endpoint for HA

Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)
2026-03-23 02:39:39 +00:00

65 lines
2.1 KiB
Markdown

# Stable Private-Only Baseline
This document defines the current engineering target for this repository.
## Topology
- 3 control planes (HA etcd cluster)
- 3 workers
- Hetzner Load Balancer for Kubernetes API
- private Hetzner network
- Tailscale operator access
## In Scope
- Terraform infrastructure bootstrap
- Ansible k3s bootstrap with external cloud provider
- **HA control plane (3 nodes with etcd quorum)**
- **Hetzner Load Balancer for Kubernetes API**
- **Hetzner CCM deployed via Ansible (before workers join)**
- **Hetzner CSI for persistent volumes (via Flux)**
- Flux core reconciliation
- External Secrets Operator with Doppler
- Tailscale private access
- Persistent volume provisioning validated
## Deferred for Later Phases
- Observability stack (deferred - complex helm release needs separate debugging)
## Out of Scope
- public ingress or DNS
- public TLS
- app workloads
- DR / backup strategy
- upgrade strategy
## Phase Gates
1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB).
2. Load Balancer is healthy with all 3 control plane targets.
3. Primary control plane bootstraps with `--cluster-init`.
4. Secondary control planes join via Load Balancer endpoint.
5. **CCM deployed via Ansible before workers join** (fixes uninitialized taint issue).
6. Workers join successfully via Load Balancer and all nodes show proper `providerID`.
7. etcd reports 3 healthy members.
8. Flux source and infrastructure reconciliation are healthy.
9. **CSI deploys and creates `hcloud-volumes` StorageClass**.
10. **PVC provisioning tested and working**.
11. External Secrets sync required secrets.
12. Tailscale private access works.
13. Terraform destroy succeeds cleanly or via workflow retry.
## Success Criteria
**ACHIEVED** - HA Cluster with CCM/CSI:
- Build 1: Initial CCM/CSI deployment and validation (2026-03-23)
- Build 2: Full destroy/rebuild cycle successful (2026-03-23)
🔄 **IN PROGRESS** - HA Control Plane Validation:
- Build 3: Deploy 3-3 topology with Load Balancer
- Build 4: Destroy/rebuild to validate HA configuration
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes.