2026-03-20 02:24:37 +00:00
# Stable Private-Only Baseline
This document defines the current engineering target for this repository.
## Topology
2026-03-23 02:39:39 +00:00
- 3 control planes (HA etcd cluster)
- 3 workers
- Hetzner Load Balancer for Kubernetes API
2026-03-20 02:24:37 +00:00
- private Hetzner network
2026-04-18 19:59:13 +00:00
- Tailscale operator access and service exposure
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net` )
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net` )
- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090` )
- `apps` Kustomization suspended by default
2026-03-20 02:24:37 +00:00
## In Scope
- Terraform infrastructure bootstrap
2026-03-23 02:25:00 +00:00
- Ansible k3s bootstrap with external cloud provider
2026-03-23 02:39:39 +00:00
- **HA control plane (3 nodes with etcd quorum)**
- **Hetzner Load Balancer for Kubernetes API**
2026-03-23 02:25:00 +00:00
- **Hetzner CCM deployed via Ansible (before workers join)**
- **Hetzner CSI for persistent volumes (via Flux)**
2026-03-20 02:24:37 +00:00
- Flux core reconciliation
- External Secrets Operator with Doppler
2026-04-18 19:59:13 +00:00
- Tailscale private access and smoke-check validation
- cert-manager
- Rancher and rancher-backup
2026-04-18 21:27:42 +00:00
- Rancher backup/restore validation
2026-04-18 19:59:13 +00:00
- Observability stack (Grafana, Prometheus, Loki, Promtail)
2026-03-23 02:25:00 +00:00
- Persistent volume provisioning validated
2026-03-20 02:24:37 +00:00
2026-03-21 18:41:36 +00:00
## Deferred for Later Phases
2026-04-18 19:59:13 +00:00
- app workloads in `apps/`
2026-03-21 18:41:36 +00:00
2026-03-20 02:24:37 +00:00
## Out of Scope
- public ingress or DNS
- public TLS
- app workloads
2026-04-18 21:27:42 +00:00
- cross-region / multi-cluster disaster recovery strategy
2026-03-20 02:24:37 +00:00
- upgrade strategy
## Phase Gates
2026-03-23 02:39:39 +00:00
1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB).
2. Load Balancer is healthy with all 3 control plane targets.
3. Primary control plane bootstraps with `--cluster-init` .
4. Secondary control planes join via Load Balancer endpoint.
5. **CCM deployed via Ansible before workers join ** (fixes uninitialized taint issue).
6. Workers join successfully via Load Balancer and all nodes show proper `providerID` .
7. etcd reports 3 healthy members.
8. Flux source and infrastructure reconciliation are healthy.
9. **CSI deploys and creates `hcloud-volumes` StorageClass ** .
10. **PVC provisioning tested and working ** .
11. External Secrets sync required secrets.
2026-04-18 19:59:13 +00:00
12. Tailscale private access works for Rancher, Grafana, and Prometheus.
13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping` , and HTTP reachability.
2026-04-18 21:27:42 +00:00
14. A fresh Rancher backup can be created and restored successfully.
15. Terraform destroy succeeds cleanly or via workflow retry.
2026-03-20 02:24:37 +00:00
## Success Criteria
2026-04-18 19:59:13 +00:00
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.
2026-04-18 21:27:42 +00:00
## Validated Drills
- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.
- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:
- all nodes remained `Ready`
- Flux infrastructure stayed healthy
- Rancher backup/restore resources reported `Completed`
- Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks