fix: add tailnet smoke checks and move Tailscale operator to stable
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m55s

Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
This commit is contained in:
2026-04-18 19:59:13 +00:00
parent 60f466ab98
commit 7385c2263e
7 changed files with 132 additions and 49 deletions
+23 -30
View File
@@ -7,18 +7,11 @@ Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible
| Component | Details |
|-----------|---------|
| **Control Plane** | 3x CX23 (HA) |
| **Workers** | 4x CX33 |
| **Total Cost** | €28.93/mo |
| **Workers** | 3x CX33 |
| **K8s** | k3s (latest, HA) |
| **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
| **Access** | SSH/API and Rancher UI restricted to Tailnet |
| **Bootstrap** | Terraform + Ansible |
### Cluster Resources
- 22 vCPU total (6 CP + 16 workers)
- 44 GB RAM total (12 CP + 32 workers)
- 440 GB SSD storage
- 140 TB bandwidth allocation
| **Access** | SSH/API and private services restricted to Tailnet |
| **Bootstrap** | Terraform + Ansible + Flux |
## Prerequisites
@@ -143,15 +136,14 @@ export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes
```
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.<your-tailnet>`).
Use `scripts/refresh-kubeconfig.sh <cp1-public-ip>` to refresh kubeconfig against the primary control-plane public IP after rebuilds.
## Gitea CI/CD
This repository includes Gitea workflows for:
- **terraform-plan**: Runs on PRs, shows planned changes
- **terraform-apply**: Runs on main branch after merge
- **ansible-deploy**: Runs after terraform apply
- **deploy**: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
- **destroy**: Cluster teardown with backup-aware cleanup
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
### Required Gitea Secrets
@@ -181,13 +173,13 @@ This repo uses Flux for continuous reconciliation after Terraform + Ansible boot
### Stable private-only baseline
The current default target is a deliberately simplified baseline:
The current default target is the HA private baseline:
- `1` control plane node
- `2` worker nodes
- `3` control plane nodes
- `3` worker nodes
- private Hetzner network only
- Tailscale for operator access
- Flux-managed core addons only
- Tailscale for operator and service access
- Flux-managed platform addons with `apps` suspended by default
Detailed phase gates and success criteria live in `STABLE_BASELINE.md`.
@@ -232,31 +224,30 @@ Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed
### Current addon status
- Core infrastructure addons are Flux-managed from `infrastructure/addons/`.
- Active Flux addons for stable baseline: `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-external-secrets`.
- Deferred addons: `addon-ccm`, `addon-csi`, `addon-observability`, `addon-observability-content` (to be added after baseline is stable).
- Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
- Active Flux addons for the current baseline: `addon-ccm`, `addon-csi`, `addon-cert-manager`, `addon-external-secrets`, `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-observability`, `addon-observability-content`, `addon-rancher`, `addon-rancher-config`, `addon-rancher-backup`, `addon-rancher-backup-config`.
- `apps` remains suspended until workload rollout is explicitly enabled.
- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
### Rancher access
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/dashboard/`.
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/`.
- The public Hetzner load balancer path is not used for Rancher.
- Rancher uses the CNPG-backed PostgreSQL cluster in `cnpg-cluster`.
- Rancher stores state in embedded etcd; no external database is used.
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
- Terraform create succeeds for the default `1` control plane and `2` workers.
- Terraform create succeeds for the default `3` control planes and `3` workers.
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready`.
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
- Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
_Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable._
## Observability Stack
Flux deploys a lightweight observability stack in the `observability` namespace:
@@ -301,9 +292,11 @@ Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated val
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
```
If `TailscaleProxyReady=False`, check: