# Stable Private-Only Baseline This document defines the current engineering target for this repository. ## Topology - 3 control planes (HA etcd cluster) - 3 workers - Hetzner Load Balancer for Kubernetes API - private Hetzner network - Tailscale operator access and service exposure - Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`) - Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`) - Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`) - `apps` Kustomization suspended by default ## In Scope - Terraform infrastructure bootstrap - Ansible k3s bootstrap with external cloud provider - **HA control plane (3 nodes with etcd quorum)** - **Hetzner Load Balancer for Kubernetes API** - **Hetzner CCM deployed via Ansible (before workers join)** - **Hetzner CSI for persistent volumes (via Flux)** - Flux core reconciliation - External Secrets Operator with Doppler - Tailscale private access and smoke-check validation - cert-manager - Rancher and rancher-backup - Observability stack (Grafana, Prometheus, Loki, Promtail) - Persistent volume provisioning validated ## Deferred for Later Phases - app workloads in `apps/` ## Out of Scope - public ingress or DNS - public TLS - app workloads - DR / backup strategy - upgrade strategy ## Phase Gates 1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB). 2. Load Balancer is healthy with all 3 control plane targets. 3. Primary control plane bootstraps with `--cluster-init`. 4. Secondary control planes join via Load Balancer endpoint. 5. **CCM deployed via Ansible before workers join** (fixes uninitialized taint issue). 6. Workers join successfully via Load Balancer and all nodes show proper `providerID`. 7. etcd reports 3 healthy members. 8. Flux source and infrastructure reconciliation are healthy. 9. **CSI deploys and creates `hcloud-volumes` StorageClass**. 10. **PVC provisioning tested and working**. 11. External Secrets sync required secrets. 12. Tailscale private access works for Rancher, Grafana, and Prometheus. 13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability. 14. Terraform destroy succeeds cleanly or via workflow retry. ## Success Criteria Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.