STABLE_BASELINE.md

# Stable Private-Only Baseline

This document defines the current engineering target for this repository.

## Topology

- 3 control planes (HA etcd cluster)
- 3 workers
- Hetzner Load Balancer for Kubernetes API
- private Hetzner network
- Tailscale operator access and service exposure
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`)
- `apps` Kustomization suspended by default

## In Scope

- Terraform infrastructure bootstrap
- Ansible k3s bootstrap with external cloud provider
- **HA control plane (3 nodes with etcd quorum)**
- **Hetzner Load Balancer for Kubernetes API**
- **Hetzner CCM deployed via Ansible (before workers join)**
- **Hetzner CSI for persistent volumes (via Flux)**
- Flux core reconciliation
- External Secrets Operator with Doppler
- Tailscale private access and smoke-check validation
- cert-manager
- Rancher and rancher-backup
- Rancher backup/restore validation
- Observability stack (Grafana, Prometheus, Loki, Promtail)
- Persistent volume provisioning validated

## Deferred for Later Phases

- app workloads in `apps/`

## Out of Scope

- public ingress or DNS
- public TLS
- app workloads
- cross-region / multi-cluster disaster recovery strategy
- upgrade strategy

## Phase Gates

1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB).
2. Load Balancer is healthy with all 3 control plane targets.
3. Primary control plane bootstraps with `--cluster-init`.
4. Secondary control planes join via Load Balancer endpoint.
5. **CCM deployed via Ansible before workers join** (fixes uninitialized taint issue).
6. Workers join successfully via Load Balancer and all nodes show proper `providerID`.
7. etcd reports 3 healthy members.
8. Flux source and infrastructure reconciliation are healthy.
9. **CSI deploys and creates `hcloud-volumes` StorageClass**.
10. **PVC provisioning tested and working**.
11. External Secrets sync required secrets.
12. Tailscale private access works for Rancher, Grafana, and Prometheus.
13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability.
14. A fresh Rancher backup can be created and restored successfully.
15. Terraform destroy succeeds cleanly or via workflow retry.

## Success Criteria

Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.

## Validated Drills

- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.
- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:
  - all nodes remained `Ready`
  - Flux infrastructure stayed healthy
  - Rancher backup/restore resources reported `Completed`
  - Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00			`# Stable Private-Only Baseline`

			`This document defines the current engineering target for this repository.`

			`## Topology`

Implement HA control plane with Load Balancer (3-3 topology) 2026-03-23 02:39:39 +00:00			`- 3 control planes (HA etcd cluster)`
			`- 3 workers`
			`- Hetzner Load Balancer for Kubernetes API`
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00			`- private Hetzner network`
fix: add tailnet smoke checks and move Tailscale operator to stable 2026-04-18 19:59:13 +00:00			`- Tailscale operator access and service exposure`
			- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
			- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
			- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`)
			- `apps` Kustomization suspended by default
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00
			`## In Scope`

			`- Terraform infrastructure bootstrap`
Update STABLE_BASELINE.md - CCM/CSI integration achieved 2026-03-23 02:25:00 +00:00			`- Ansible k3s bootstrap with external cloud provider`
Implement HA control plane with Load Balancer (3-3 topology) 2026-03-23 02:39:39 +00:00			`- HA control plane (3 nodes with etcd quorum)`
			`- Hetzner Load Balancer for Kubernetes API`
Update STABLE_BASELINE.md - CCM/CSI integration achieved 2026-03-23 02:25:00 +00:00			`- Hetzner CCM deployed via Ansible (before workers join)`
			`- Hetzner CSI for persistent volumes (via Flux)`
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00			`- Flux core reconciliation`
			`- External Secrets Operator with Doppler`
fix: add tailnet smoke checks and move Tailscale operator to stable 2026-04-18 19:59:13 +00:00			`- Tailscale private access and smoke-check validation`
			`- cert-manager`
			`- Rancher and rancher-backup`
docs: record validated Rancher restore drill 2026-04-18 21:27:42 +00:00			`- Rancher backup/restore validation`
fix: add tailnet smoke checks and move Tailscale operator to stable 2026-04-18 19:59:13 +00:00			`- Observability stack (Grafana, Prometheus, Loki, Promtail)`
Update STABLE_BASELINE.md - CCM/CSI integration achieved 2026-03-23 02:25:00 +00:00			`- Persistent volume provisioning validated`
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00
docs: update stable baseline to defer ccm/csi 2026-03-21 18:41:36 +00:00			`## Deferred for Later Phases`

fix: add tailnet smoke checks and move Tailscale operator to stable 2026-04-18 19:59:13 +00:00			- app workloads in `apps/`
docs: update stable baseline to defer ccm/csi 2026-03-21 18:41:36 +00:00
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00			`## Out of Scope`

			`- public ingress or DNS`
			`- public TLS`
			`- app workloads`
docs: record validated Rancher restore drill 2026-04-18 21:27:42 +00:00			`- cross-region / multi-cluster disaster recovery strategy`
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00			`- upgrade strategy`

			`## Phase Gates`

Implement HA control plane with Load Balancer (3-3 topology) 2026-03-23 02:39:39 +00:00			`1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB).`
			`2. Load Balancer is healthy with all 3 control plane targets.`
			3. Primary control plane bootstraps with `--cluster-init`.
			`4. Secondary control planes join via Load Balancer endpoint.`
			`5. CCM deployed via Ansible before workers join (fixes uninitialized taint issue).`
			6. Workers join successfully via Load Balancer and all nodes show proper `providerID`.
			`7. etcd reports 3 healthy members.`
			`8. Flux source and infrastructure reconciliation are healthy.`
			9. CSI deploys and creates `hcloud-volumes` StorageClass.
			`10. PVC provisioning tested and working.`
			`11. External Secrets sync required secrets.`
fix: add tailnet smoke checks and move Tailscale operator to stable 2026-04-18 19:59:13 +00:00			`12. Tailscale private access works for Rancher, Grafana, and Prometheus.`
			13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability.
docs: record validated Rancher restore drill 2026-04-18 21:27:42 +00:00			`14. A fresh Rancher backup can be created and restored successfully.`
			`15. Terraform destroy succeeds cleanly or via workflow retry.`
refactor: simplify stable cluster baseline 2026-03-20 02:24:37 +00:00
			`## Success Criteria`

fix: add tailnet smoke checks and move Tailscale operator to stable 2026-04-18 19:59:13 +00:00			Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.
docs: record validated Rancher restore drill 2026-04-18 21:27:42 +00:00
			`## Validated Drills`

			`- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.`
			`- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:`
			- all nodes remained `Ready`
			`- Flux infrastructure stayed healthy`
			- Rancher backup/restore resources reported `Completed`
			`- Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks`