Files
HetznerTerra/STABLE_BASELINE.md
T
micqdf b1dae28aa5
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s
feat: migrate cluster baseline from Hetzner to Proxmox
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
2026-04-22 03:02:13 +00:00

74 lines
2.8 KiB
Markdown

# Stable Private-Only Baseline
This document defines the current engineering target for this repository.
## Topology
- 3 control planes (HA etcd cluster)
- 5 workers
- kube-vip API VIP (`10.27.27.40`)
- private Proxmox/LAN network (`10.27.27.0/24`)
- Tailscale operator access and service exposure
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`)
- `apps` Kustomization suspended by default
## In Scope
- Terraform infrastructure bootstrap
- Ansible k3s bootstrap on Ubuntu cloud-init VMs
- **HA control plane (3 nodes with etcd quorum)**
- **kube-vip for Kubernetes API HA**
- **NFS-backed persistent volumes via `nfs-subdir-external-provisioner`**
- Flux core reconciliation
- External Secrets Operator with Doppler
- Tailscale private access and smoke-check validation
- cert-manager
- Rancher and rancher-backup
- Rancher backup/restore validation
- Observability stack (Grafana, Prometheus, Loki, Promtail)
- Persistent volume provisioning validated
## Deferred for Later Phases
- app workloads in `apps/`
## Out of Scope
- public ingress or DNS
- public TLS
- app workloads
- cross-region / multi-cluster disaster recovery strategy
- upgrade strategy
## Phase Gates
1. Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP).
2. Primary control plane bootstraps with `--cluster-init`.
3. kube-vip advertises `10.27.27.40:6443` from the control-plane set.
4. Secondary control planes join via the kube-vip endpoint.
5. Workers join successfully via the kube-vip endpoint.
7. etcd reports 3 healthy members.
8. Flux source and infrastructure reconciliation are healthy.
9. **NFS provisioner deploys and creates `flash-nfs` StorageClass**.
10. **PVC provisioning tested and working**.
11. External Secrets sync required secrets.
12. Tailscale private access works for Rancher, Grafana, and Prometheus.
13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability.
14. A fresh Rancher backup can be created and restored successfully.
15. Terraform destroy succeeds cleanly or via workflow retry.
## Success Criteria
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.
## Validated Drills
- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.
- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:
- all nodes remained `Ready`
- Flux infrastructure stayed healthy
- Rancher backup/restore resources reported `Completed`
- Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks