OpenStaticFish/HetznerTerra

Fork 0

Files

T

micqdf b1dae28aa5

Deploy Cluster / Terraform (push) Failing after 52s

Details

Deploy Cluster / Ansible (push) Has been skipped

Details

Deploy Grafana Content / Grafana Content (push) Failing after 1m37s

Details

feat: migrate cluster baseline from Hetzner to Proxmox

Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.

2026-04-22 03:02:13 +00:00

2.8 KiB

Raw Permalink Blame History

Stable Private-Only Baseline

This document defines the current engineering target for this repository.

Topology

3 control planes (HA etcd cluster)
5 workers
kube-vip API VIP (10.27.27.40)
private Proxmox/LAN network (10.27.27.0/24)
Tailscale operator access and service exposure
Rancher exposed through Tailscale (rancher.silverside-gopher.ts.net)
Grafana exposed through Tailscale (grafana.silverside-gopher.ts.net)
Prometheus exposed through Tailscale (prometheus.silverside-gopher.ts.net:9090)
apps Kustomization suspended by default

In Scope

Terraform infrastructure bootstrap
Ansible k3s bootstrap on Ubuntu cloud-init VMs
HA control plane (3 nodes with etcd quorum)
kube-vip for Kubernetes API HA
NFS-backed persistent volumes via nfs-subdir-external-provisioner
Flux core reconciliation
External Secrets Operator with Doppler
Tailscale private access and smoke-check validation
cert-manager
Rancher and rancher-backup
Rancher backup/restore validation
Observability stack (Grafana, Prometheus, Loki, Promtail)
Persistent volume provisioning validated

Deferred for Later Phases

app workloads in apps/

Out of Scope

public ingress or DNS
public TLS
app workloads
cross-region / multi-cluster disaster recovery strategy
upgrade strategy

Phase Gates

Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP).
Primary control plane bootstraps with --cluster-init.
kube-vip advertises 10.27.27.40:6443 from the control-plane set.
Secondary control planes join via the kube-vip endpoint.
Workers join successfully via the kube-vip endpoint.
etcd reports 3 healthy members.
Flux source and infrastructure reconciliation are healthy.
NFS provisioner deploys and creates flash-nfs StorageClass.
PVC provisioning tested and working.
External Secrets sync required secrets.
Tailscale private access works for Rancher, Grafana, and Prometheus.
CI smoke checks pass for Tailscale DNS resolution, tailscale ping, and HTTP reachability.
A fresh Rancher backup can be created and restored successfully.
Terraform destroy succeeds cleanly or via workflow retry.

Success Criteria

Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual kubectl patching, and no manual Tailscale proxy recreation.

Validated Drills

2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.
A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:
- all nodes remained Ready
- Flux infrastructure stayed healthy
- Rancher backup/restore resources reported Completed
- Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks

2.8 KiB Raw Permalink Blame History

Stable Private-Only Baseline

Topology

In Scope

Deferred for Later Phases

Out of Scope

Phase Gates

Success Criteria

Validated Drills

2.8 KiB

Raw Permalink Blame History