Files
HetznerTerra/AGENTS.md
T
micqdf b1dae28aa5
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s
feat: migrate cluster baseline from Hetzner to Proxmox
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
2026-04-22 03:02:13 +00:00

3.4 KiB

AGENTS.md

Repository guide for OpenCode sessions in this repo.

Read First

  • Trust manifests and workflows over prose when they conflict.
  • Highest-value sources: terraform/main.tf, terraform/variables.tf, ansible/site.yml, clusters/prod/flux-system/, infrastructure/addons/kustomization.yaml, .gitea/workflows/deploy.yml, .gitea/workflows/destroy.yml, README.md, STABLE_BASELINE.md, scripts/refresh-kubeconfig.sh, scripts/smoke-check-tailnet-services.sh.

Current Baseline

  • HA private cluster: 3 control planes, 5 workers on Proxmox.
  • Proxmox clones come from template 9000 on node flex; API VIP is 10.27.27.40 via kube-vip.
  • Storage is nfs-subdir-external-provisioner backed by 10.27.27.22:/TheFlash/k8s-nfs with StorageClass flash-nfs.
  • Tailscale is the private access path for Rancher and shared services.
  • Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
  • apps/ is suspended by default.
  • Rancher stores state in embedded etcd; backup/restore uses rancher-backup to B2.

Common Commands

  • Terraform: terraform -chdir=terraform fmt -recursive, terraform -chdir=terraform validate, terraform -chdir=terraform plan -var-file=../terraform.tfvars, terraform -chdir=terraform apply -var-file=../terraform.tfvars
  • Ansible: ansible-galaxy collection install -r ansible/requirements.yml, cd ansible && python3 generate_inventory.py, ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check, ansible-playbook ansible/site.yml
  • Flux/Kustomize: kubectl kustomize infrastructure/addons/<addon>, kubectl kustomize clusters/prod/flux-system
  • Kubeconfig refresh: scripts/refresh-kubeconfig.sh <cp1-ip>
  • Tailnet smoke check: ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh

Workflow Rules

  • Keep diffs small and validate only the directory you edited.
  • Update manifests and docs together when behavior changes.
  • Use set -euo pipefail in workflow shell blocks.
  • CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
  • One object per Kubernetes YAML file; keep filenames kebab-case.
  • If kubectl points at localhost:8080 after a rebuild, refresh kubeconfig from the primary control-plane IP.
  • Bootstrap assumptions that matter: SSH user is ubuntu, NIC is ens18, API join endpoint is the kube-vip address.

Repo-Specific Gotchas

  • rancher-backup uses a postRenderer to swap the broken hook image to rancher/kubectl:v1.34.0; do not put S3 config in HelmRelease values. Put it in the Backup CR.
  • Tailscale cleanup only runs before service proxies exist; it removes stale offline rancher/grafana/prometheus/flux devices, then must stop so live proxies are not deleted.
  • Keep the Tailscale operator on the stable Helm repo https://pkgs.tailscale.com/helmcharts at 1.96.5 unless you have a reason to change it.
  • The repo no longer uses a cloud controller manager. If you see providerID or Hetzner-specific logic, it is stale.
  • Current private URLs:
    • Rancher: https://rancher.silverside-gopher.ts.net/
    • Grafana: http://grafana.silverside-gopher.ts.net/
    • Prometheus: http://prometheus.silverside-gopher.ts.net:9090/

Secrets

  • Runtime secrets live in Doppler + External Secrets.
  • Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys.