# Stable Private-Only Baseline This document defines the current engineering target for this repository. ## Topology - 3 control planes (HA etcd cluster) - 5 workers - kube-vip API VIP (`10.27.27.40`) - private Proxmox/LAN network (`10.27.27.0/24`) - Tailscale operator access and service exposure - Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`) - Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`) - Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`) - `apps` Kustomization suspended by default ## In Scope - Terraform infrastructure bootstrap - Ansible k3s bootstrap on Ubuntu cloud-init VMs - **HA control plane (3 nodes with etcd quorum)** - **kube-vip for Kubernetes API HA** - **NFS-backed persistent volumes via `nfs-subdir-external-provisioner`** - Flux core reconciliation - External Secrets Operator with Doppler - Tailscale private access and smoke-check validation - cert-manager - Rancher and rancher-backup - Rancher backup/restore validation - Observability stack (Grafana, Prometheus, Loki, Promtail) - Persistent volume provisioning validated ## Deferred for Later Phases - app workloads in `apps/` ## Out of Scope - public ingress or DNS - public TLS - app workloads - cross-region / multi-cluster disaster recovery strategy - upgrade strategy ## Phase Gates 1. Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP). 2. Primary control plane bootstraps with `--cluster-init`. 3. kube-vip advertises `10.27.27.40:6443` from the control-plane set. 4. Secondary control planes join via the kube-vip endpoint. 5. Workers join successfully via the kube-vip endpoint. 7. etcd reports 3 healthy members. 8. Flux source and infrastructure reconciliation are healthy. 9. **NFS provisioner deploys and creates `flash-nfs` StorageClass**. 10. **PVC provisioning tested and working**. 11. External Secrets sync required secrets. 12. Tailscale private access works for Rancher, Grafana, and Prometheus. 13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability. 14. A fresh Rancher backup can be created and restored successfully. 15. Terraform destroy succeeds cleanly or via workflow retry. ## Success Criteria Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation. ## Validated Drills - 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster. - A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed: - all nodes remained `Ready` - Flux infrastructure stayed healthy - Rancher backup/restore resources reported `Completed` - Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks