feat: migrate cluster baseline from Hetzner to Proxmox

Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.
2026-04-22 03:02:13 +00:00
parent 6c6b9d20ca
commit b1dae28aa5
40 changed files with 577 additions and 784 deletions
@@ -9,7 +9,9 @@ Repository guide for OpenCode sessions in this repo.

 ## Current Baseline

- HA private cluster: 3 control planes, 3 workers.
+- HA private cluster: 3 control planes, 5 workers on Proxmox.
+- Proxmox clones come from template `9000` on node `flex`; API VIP is `10.27.27.40` via kube-vip.
+- Storage is `nfs-subdir-external-provisioner` backed by `10.27.27.22:/TheFlash/k8s-nfs` with StorageClass `flash-nfs`.
 - Tailscale is the private access path for Rancher and shared services.
 - Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
 - `apps/` is suspended by default.
@@ -20,8 +22,8 @@ Repository guide for OpenCode sessions in this repo.
 - Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
 - Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml`
 - Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system`
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
- Tailnet smoke check: `ssh root@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
+- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-ip>`
+- Tailnet smoke check: `ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`

 ## Workflow Rules

@@ -31,12 +33,14 @@ Repository guide for OpenCode sessions in this repo.
 - CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
 - One object per Kubernetes YAML file; keep filenames kebab-case.
 - If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP.
+- Bootstrap assumptions that matter: SSH user is `ubuntu`, NIC is `ens18`, API join endpoint is the kube-vip address.

 ## Repo-Specific Gotchas

 - `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR.
 - Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted.
 - Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it.
+- The repo no longer uses a cloud controller manager. If you see `providerID` or Hetzner-specific logic, it is stale.
 - Current private URLs:
  - Rancher: `https://rancher.silverside-gopher.ts.net/`
  - Grafana: `http://grafana.silverside-gopher.ts.net/`