Files
HetznerTerra/AGENTS.md
T
2026-04-30 07:03:21 +00:00

5.3 KiB

AGENTS.md

Compact repo guidance for OpenCode sessions. Trust executable sources over docs when they conflict.

Read First

  • Highest-value sources: .gitea/workflows/deploy.yml, .gitea/workflows/destroy.yml, terraform/main.tf, terraform/variables.tf, terraform/servers.tf, ansible/site.yml, ansible/inventory.tmpl, clusters/prod/flux-system/, infrastructure/addons/kustomization.yaml.
  • STABLE_BASELINE.md still contains stale Rancher backup/restore references; current workflows and addon manifests do not deploy or restore rancher-backup.

Baseline

  • Proxmox HA K3s cluster: 3 control planes, 5 workers, VMIDs 200-202 and 210-214, node flex, template VMID 9000, datastore Flash.
  • API HA is kube-vip at 10.27.27.40; control planes are 10.27.27.30-32, workers are 10.27.27.41-45.
  • SSH user is ubuntu; Ansible derives the flannel iface from ansible_default_ipv4.interface with eth0 fallback, so do not hard-code ens18.
  • Storage is raw-manifest nfs-subdir-external-provisioner using 10.27.27.239:/TheFlash/k8s-nfs and default StorageClass flash-nfs.
  • Tailscale is the private access path. Rancher, Grafana, and Prometheus are exposed only through Tailscale services.
  • apps is intentionally suspended in clusters/prod/flux-system/kustomization-apps.yaml.

Commands

  • Terraform: terraform -chdir=terraform fmt -recursive, terraform -chdir=terraform validate, terraform -chdir=terraform plan -var-file=../terraform.tfvars, terraform -chdir=terraform apply -var-file=../terraform.tfvars.
  • Ansible setup: ansible-galaxy collection install -r ansible/requirements.yml, then from ansible/ run python3 generate_inventory.py and ansible-playbook site.yml --syntax-check.
  • Flux/Kustomize checks: kubectl kustomize infrastructure/addons/<addon>, kubectl kustomize infrastructure/addons, kubectl kustomize clusters/prod/flux-system.
  • Kubeconfig refresh: scripts/refresh-kubeconfig.sh <cp1-ip>; use this if local kubectl falls back to localhost:8080 after rebuilds.
  • Tailnet smoke check from cp1: ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh.
  • Fast Grafana content iteration uses .gitea/workflows/dashboards.yml and ansible/dashboards.yml, not a full cluster rebuild.

Deploy Flow

  • Pushes to main run Gitea CI: Terraform fmt/init/validate/plan/apply, Proxmox cleanup/retry, Ansible bootstrap, Flux bootstrap, addon gates, Rancher gate, observability image seeding, health checks, tailnet smoke checks.
  • Deploy and destroy workflows share concurrency.group: prod-cluster; destroy only requires workflow input confirm: destroy and has no backup gate.
  • Keep set -euo pipefail in workflow shell blocks.
  • Terraform retry cleanup has hard-coded target VMIDs/names in .gitea/workflows/deploy.yml; update it when changing node counts, names, or VMIDs.
  • Fresh VMs have unreliable registry/chart egress, so critical images are prepared by skopeo on the runner and imported with k3s ctr; update the workflow archive lists when adding bootstrap-time images.
  • CI applies clusters/prod/flux-system/gotk-components.yaml directly and then patches Flux controller deployments inline; changes only in gotk-controller-cp1-patches.yaml do not affect CI bootstrap.

GitOps Addons

  • Vendored charts are intentional: infrastructure/charts/{cert-manager,traefik,kube-prometheus-stack,tailscale-operator,rancher}. Do not restore remote HelmRepository objects unless cluster-side chart fetch reliability is intentionally changed.
  • External Secrets and Loki/Promtail use Flux OCIRepository; Rancher, Tailscale, cert-manager, Traefik, and kube-prometheus-stack use GitRepository chart paths.
  • Use fully qualified helmchart.source.toolkit.fluxcd.io/... in scripts; K3s also has helmcharts.helm.cattle.io, so helmchart/... can target the wrong resource.
  • doppler-bootstrap only creates the external-secrets namespace and Doppler token secret. The deploy workflow creates ClusterSecretStore/doppler-hetznerterra after ESO CRDs and webhook endpoints exist.
  • The checked-in infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml is not included by that addon kustomization; do not assume Flux applies it.
  • Keep Kubernetes manifests one object per file with kebab-case filenames.

Gotchas

  • Rancher chart 2.13.3 requires Kubernetes <1.35.0-0; K3s latest can break Rancher. Role defaults pin v1.34.6+k3s1; do not reintroduce a generated-inventory k3s_version=latest override.
  • The repo no longer uses a cloud controller manager. providerID, Hetzner CCM/CSI, or Hetzner firewall/load-balancer logic is stale.
  • Tailscale cleanup must only remove stale offline reserved hostnames before live service proxies exist; do not delete active rancher, grafana, prometheus, or flux devices.
  • Proxmox endpoint should be the base URL, for example https://100.105.0.115:8006/; provider/workflow code strips /api2/json when needed.
  • Current private URLs: Rancher https://rancher.silverside-gopher.ts.net/, Grafana http://grafana.silverside-gopher.ts.net/, Prometheus http://prometheus.silverside-gopher.ts.net:9090/.

Secrets

  • Runtime secrets are Doppler + External Secrets; Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
  • Never commit secrets, kubeconfigs, private keys, terraform.tfvars, or generated outputs/ artifacts.