feat: migrate cluster baseline from Hetzner to Proxmox
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s

Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
This commit is contained in:
2026-04-22 03:02:13 +00:00
parent 6c6b9d20ca
commit b1dae28aa5
40 changed files with 577 additions and 784 deletions
+7 -3
View File
@@ -9,7 +9,9 @@ Repository guide for OpenCode sessions in this repo.
## Current Baseline
- HA private cluster: 3 control planes, 3 workers.
- HA private cluster: 3 control planes, 5 workers on Proxmox.
- Proxmox clones come from template `9000` on node `flex`; API VIP is `10.27.27.40` via kube-vip.
- Storage is `nfs-subdir-external-provisioner` backed by `10.27.27.22:/TheFlash/k8s-nfs` with StorageClass `flash-nfs`.
- Tailscale is the private access path for Rancher and shared services.
- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
- `apps/` is suspended by default.
@@ -20,8 +22,8 @@ Repository guide for OpenCode sessions in this repo.
- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml`
- Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system`
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
- Tailnet smoke check: `ssh root@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-ip>`
- Tailnet smoke check: `ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
## Workflow Rules
@@ -31,12 +33,14 @@ Repository guide for OpenCode sessions in this repo.
- CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
- One object per Kubernetes YAML file; keep filenames kebab-case.
- If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP.
- Bootstrap assumptions that matter: SSH user is `ubuntu`, NIC is `ens18`, API join endpoint is the kube-vip address.
## Repo-Specific Gotchas
- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR.
- Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted.
- Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it.
- The repo no longer uses a cloud controller manager. If you see `providerID` or Hetzner-specific logic, it is stale.
- Current private URLs:
- Rancher: `https://rancher.silverside-gopher.ts.net/`
- Grafana: `http://grafana.silverside-gopher.ts.net/`