7385c2263e
Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness, reachability, and service responses for Rancher, Grafana, and Prometheus. Move the operator to the stable Helm repo/version and align the baseline docs with the current HA private-only architecture.
2.4 KiB
2.4 KiB
Stable Private-Only Baseline
This document defines the current engineering target for this repository.
Topology
- 3 control planes (HA etcd cluster)
- 3 workers
- Hetzner Load Balancer for Kubernetes API
- private Hetzner network
- Tailscale operator access and service exposure
- Rancher exposed through Tailscale (
rancher.silverside-gopher.ts.net) - Grafana exposed through Tailscale (
grafana.silverside-gopher.ts.net) - Prometheus exposed through Tailscale (
prometheus.silverside-gopher.ts.net:9090) appsKustomization suspended by default
In Scope
- Terraform infrastructure bootstrap
- Ansible k3s bootstrap with external cloud provider
- HA control plane (3 nodes with etcd quorum)
- Hetzner Load Balancer for Kubernetes API
- Hetzner CCM deployed via Ansible (before workers join)
- Hetzner CSI for persistent volumes (via Flux)
- Flux core reconciliation
- External Secrets Operator with Doppler
- Tailscale private access and smoke-check validation
- cert-manager
- Rancher and rancher-backup
- Observability stack (Grafana, Prometheus, Loki, Promtail)
- Persistent volume provisioning validated
Deferred for Later Phases
- app workloads in
apps/
Out of Scope
- public ingress or DNS
- public TLS
- app workloads
- DR / backup strategy
- upgrade strategy
Phase Gates
- Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB).
- Load Balancer is healthy with all 3 control plane targets.
- Primary control plane bootstraps with
--cluster-init. - Secondary control planes join via Load Balancer endpoint.
- CCM deployed via Ansible before workers join (fixes uninitialized taint issue).
- Workers join successfully via Load Balancer and all nodes show proper
providerID. - etcd reports 3 healthy members.
- Flux source and infrastructure reconciliation are healthy.
- CSI deploys and creates
hcloud-volumesStorageClass. - PVC provisioning tested and working.
- External Secrets sync required secrets.
- Tailscale private access works for Rancher, Grafana, and Prometheus.
- CI smoke checks pass for Tailscale DNS resolution,
tailscale ping, and HTTP reachability. - Terraform destroy succeeds cleanly or via workflow retry.
Success Criteria
Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual kubectl patching, and no manual Tailscale proxy recreation.