b1dae28aa5
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.
74 lines
2.8 KiB
Markdown
74 lines
2.8 KiB
Markdown
# Stable Private-Only Baseline
|
|
|
|
This document defines the current engineering target for this repository.
|
|
|
|
## Topology
|
|
|
|
- 3 control planes (HA etcd cluster)
|
|
- 5 workers
|
|
- kube-vip API VIP (`10.27.27.40`)
|
|
- private Proxmox/LAN network (`10.27.27.0/24`)
|
|
- Tailscale operator access and service exposure
|
|
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
|
|
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
|
|
- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`)
|
|
- `apps` Kustomization suspended by default
|
|
|
|
## In Scope
|
|
|
|
- Terraform infrastructure bootstrap
|
|
- Ansible k3s bootstrap on Ubuntu cloud-init VMs
|
|
- **HA control plane (3 nodes with etcd quorum)**
|
|
- **kube-vip for Kubernetes API HA**
|
|
- **NFS-backed persistent volumes via `nfs-subdir-external-provisioner`**
|
|
- Flux core reconciliation
|
|
- External Secrets Operator with Doppler
|
|
- Tailscale private access and smoke-check validation
|
|
- cert-manager
|
|
- Rancher and rancher-backup
|
|
- Rancher backup/restore validation
|
|
- Observability stack (Grafana, Prometheus, Loki, Promtail)
|
|
- Persistent volume provisioning validated
|
|
|
|
## Deferred for Later Phases
|
|
|
|
- app workloads in `apps/`
|
|
|
|
## Out of Scope
|
|
|
|
- public ingress or DNS
|
|
- public TLS
|
|
- app workloads
|
|
- cross-region / multi-cluster disaster recovery strategy
|
|
- upgrade strategy
|
|
|
|
## Phase Gates
|
|
|
|
1. Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP).
|
|
2. Primary control plane bootstraps with `--cluster-init`.
|
|
3. kube-vip advertises `10.27.27.40:6443` from the control-plane set.
|
|
4. Secondary control planes join via the kube-vip endpoint.
|
|
5. Workers join successfully via the kube-vip endpoint.
|
|
7. etcd reports 3 healthy members.
|
|
8. Flux source and infrastructure reconciliation are healthy.
|
|
9. **NFS provisioner deploys and creates `flash-nfs` StorageClass**.
|
|
10. **PVC provisioning tested and working**.
|
|
11. External Secrets sync required secrets.
|
|
12. Tailscale private access works for Rancher, Grafana, and Prometheus.
|
|
13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability.
|
|
14. A fresh Rancher backup can be created and restored successfully.
|
|
15. Terraform destroy succeeds cleanly or via workflow retry.
|
|
|
|
## Success Criteria
|
|
|
|
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.
|
|
|
|
## Validated Drills
|
|
|
|
- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.
|
|
- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:
|
|
- all nodes remained `Ready`
|
|
- Flux infrastructure stayed healthy
|
|
- Rancher backup/restore resources reported `Completed`
|
|
- Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks
|