Files
HetznerTerra/AGENTS.md
T
micqdf 9126de1423
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Failing after 4m52s
fix: Align Prometheus external URL with Tailscale service port
Prometheus is exposed on port 9090 through the Tailscale LoadBalancer
service, so the configured external URL and repo docs should match the
actual address users reach after rebuilds.
2026-04-18 17:11:16 +00:00

8.0 KiB

AGENTS.md

Repository guide for agentic contributors working in this repo.

Scope

  • Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
  • Primary areas: terraform/, ansible/, clusters/, infrastructure/, apps/, .gitea/workflows/.
  • Treat README.md and STABLE_BASELINE.md as user-facing context, but prefer current manifests and workflows as source of truth.
  • Keep changes small and reviewable; prefer the narrowest file set that solves the task.

Architecture

  • Terraform provisions Hetzner servers, network, firewall, load balancer, SSH keys.
  • Ansible bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
  • Flux reconciles all cluster addons from this repo after Ansible hands off.
  • Rancher stores state in embedded etcd (NOT an external DB). Backup/restore uses the rancher-backup operator to B2.
  • cert-manager is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
  • Secrets flow: Doppler → ClusterSecretStore (doppler-hetznerterra) → ExternalSecret resources → k8s Secrets.
  • Rancher is reachable only over Tailscale at https://rancher.silverside-gopher.ts.net/.
  • Grafana, Prometheus, and Flux UI are also exposed via dedicated Tailscale LoadBalancer services at http://grafana.silverside-gopher.ts.net/, http://prometheus.silverside-gopher.ts.net:9090/, http://flux.silverside-gopher.ts.net:9001/.

Important Files

  • terraform/main.tf — provider and version pins
  • terraform/variables.tf — input surface and defaults
  • terraform/firewall.tf — firewall rules (tailnet CIDR, internal cluster ports)
  • ansible/site.yml — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup)
  • ansible/generate_inventory.py — renders ansible/inventory.ini from Terraform outputs via Jinja2
  • clusters/prod/flux-system/ — Flux GitRepository and top-level Kustomization resources
  • infrastructure/addons/kustomization.yaml — root addon graph with dependency ordering
  • infrastructure/addons/<addon>/ — each addon is a self-contained dir with its own kustomization.yaml
  • .gitea/workflows/deploy.yml — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks

Build / Validate / Test

Terraform

  • Format: terraform -chdir=terraform fmt -recursive
  • Check formatting: terraform -chdir=terraform fmt -check -recursive
  • Validate: terraform -chdir=terraform validate
  • Plan (full): terraform -chdir=terraform plan -var-file=../terraform.tfvars
  • Plan one resource: terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]
  • Apply: terraform -chdir=terraform apply -var-file=../terraform.tfvars
  • State inspection: terraform -chdir=terraform state list / terraform state show <address>

Ansible

  • Install collections: ansible-galaxy collection install -r ansible/requirements.yml
  • Generate inventory: cd ansible && python3 generate_inventory.py (requires Terraform outputs)
  • Syntax check: ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check
  • Dry-run one host: ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]
  • Full bootstrap: ansible-playbook ansible/site.yml
  • Targeted: ansible-playbook ansible/site.yml -t upgrade or -t reset
  • Dashboards only: ansible-playbook ansible/dashboards.yml

Python

  • Syntax check: python3 -m py_compile ansible/generate_inventory.py
  • Run: cd ansible && python3 generate_inventory.py

Kubernetes / Flux manifests

  • Render single addon: kubectl kustomize infrastructure/addons/<addon>
  • Render cluster bootstrap: kubectl kustomize clusters/prod/flux-system
  • Validate only the directory you edited, not the whole repo.

Kubeconfig refresh

  • Preferred: scripts/refresh-kubeconfig.sh <cp1-public-ip>
  • Manual: ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig

Code Style

General

  • Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
  • Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
  • No comments unless the logic is non-obvious.

Terraform / HCL

  • 2-space indent. terraform {} block first, then providers, locals, variables, resources, outputs.
  • snake_case for variables, locals, resources. Descriptions on all variables/outputs.
  • sensitive = true on secrets. Run terraform fmt instead of hand-formatting.
  • Use locals for reused or non-trivial logic. Explicit depends_on only when required.

Ansible / YAML

  • 2-space YAML indent. Descriptive task names in sentence case.
  • Idempotent tasks: changed_when: false and failed_when: false for probes.
  • command/shell only when no dedicated module fits. shell only for pipes/redirection/heredocs.
  • when guards and default(...) filters over duplicated tasks.
  • Role names and filenames: kebab-case. Variables: snake_case.
  • Multi-line shell in workflows: set -e or set -euo pipefail for fail-fast.

Kubernetes / Flux YAML

  • One object per file. Kebab-case filenames matching repo patterns: helmrelease-*.yaml, kustomization-*.yaml, *-externalsecret.yaml.
  • Addon manifests live in infrastructure/addons/<addon>/ with a kustomization.yaml.
  • Flux graph objects in clusters/prod/flux-system/.
  • Each addon gets a kustomization-<addon>.yaml entry in infrastructure/addons/ with dependsOn for ordering.
  • Quote strings with :, *, cron expressions, or shell-sensitive chars.
  • Preserve existing labels/annotations unless the change specifically needs them.

Python

  • PEP 8. Imports ordered: stdlib, third-party, local. snake_case for functions/variables.
  • Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.

Known Issues & Workarounds

  • rancher-backup post-install job (rancher-backup-patch-sa) uses a postRenderer in the HelmRelease to replace the broken rancher/kuberlr-kubectl image with rancher/kubectl. Do NOT set s3 block in HelmRelease values — put S3 config in the Backup CR instead.
  • B2 ExternalSecret must use key names accessKey and secretKey (not aws_access_key_id/aws_secret_access_key).
  • Stale Tailscale devices: After cluster rebuild, delete stale offline rancher devices before booting. The tailscale-cleanup Ansible role handles this via the Tailscale API.
  • Restricted B2 keys: b2_authorize_account may return allowed.bucketId: null. CI falls back to b2_list_buckets to resolve bucket ID by name.

Secrets / Security

  • Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
  • Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
  • terraform.tfvars and outputs/ are gitignored. Never print secret values in logs or commits.

CI Pipeline (.gitea/workflows/deploy.yml)

  1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
  2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
  3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
  4. Rancher wait: wait for Rancher and backup operator to be ready
  5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
  6. Health checks: nodes, Flux objects, pods, storage class

Editing Practices

  • Read target file and adjacent patterns before editing.
  • Run the narrowest validation command after edits.
  • If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
  • Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
  • Check git status before and after changes.

Cursor / Copilot Rules

  • No .cursor/rules/, .cursorrules, or .github/copilot-instructions.md files exist.
  • If added later, mirror their guidance here and treat them as authoritative.