fix: add tailnet smoke checks and move Tailscale operator to stable
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m55s

Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
This commit is contained in:
2026-04-18 19:59:13 +00:00
parent 60f466ab98
commit 7385c2263e
7 changed files with 132 additions and 49 deletions
+14 -15
View File
@@ -8,8 +8,11 @@ This document defines the current engineering target for this repository.
- 3 workers
- Hetzner Load Balancer for Kubernetes API
- private Hetzner network
- Tailscale operator access
- Rancher UI exposed only through Tailscale (`rancher.silverside-gopher.ts.net`)
- Tailscale operator access and service exposure
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`)
- `apps` Kustomization suspended by default
## In Scope
@@ -21,12 +24,15 @@ This document defines the current engineering target for this repository.
- **Hetzner CSI for persistent volumes (via Flux)**
- Flux core reconciliation
- External Secrets Operator with Doppler
- Tailscale private access
- Tailscale private access and smoke-check validation
- cert-manager
- Rancher and rancher-backup
- Observability stack (Grafana, Prometheus, Loki, Promtail)
- Persistent volume provisioning validated
## Deferred for Later Phases
- Observability stack (deferred - complex helm release needs separate debugging)
- app workloads in `apps/`
## Out of Scope
@@ -49,17 +55,10 @@ This document defines the current engineering target for this repository.
9. **CSI deploys and creates `hcloud-volumes` StorageClass**.
10. **PVC provisioning tested and working**.
11. External Secrets sync required secrets.
12. Tailscale private access works, including Rancher UI access.
13. Terraform destroy succeeds cleanly or via workflow retry.
12. Tailscale private access works for Rancher, Grafana, and Prometheus.
13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability.
14. Terraform destroy succeeds cleanly or via workflow retry.
## Success Criteria
**ACHIEVED** - HA Cluster with CCM/CSI:
- Build 1: Initial CCM/CSI deployment and validation (2026-03-23)
- Build 2: Full destroy/rebuild cycle successful (2026-03-23)
🔄 **IN PROGRESS** - HA Control Plane Validation:
- Build 3: Deploy 3-3 topology with Load Balancer
- Build 4: Destroy/rebuild to validate HA configuration
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes.
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.