Files
HetznerTerra/AGENTS.md
MichaelFisher1997 569d741751
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m37s
Deploy Cluster / Ansible (push) Successful in 25m37s
push
2026-03-31 02:46:55 +00:00

7.8 KiB

AGENTS.md

Repository guide for agentic contributors working in this repo.

Scope

  • Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
  • Primary areas: terraform/, ansible/, clusters/, infrastructure/, apps/, .gitea/workflows/.
  • Treat README.md and STABLE_BASELINE.md as user-facing context, but prefer current manifests and workflows as source of truth.
  • Keep changes small and reviewable; prefer the narrowest file set that solves the task.

Architecture

  • Terraform provisions Hetzner servers, network, firewall, load balancer, SSH keys.
  • Ansible bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
  • Flux reconciles all cluster addons from this repo after Ansible hands off.
  • Rancher stores state in embedded etcd (NOT an external DB). Backup/restore uses the rancher-backup operator to B2.
  • cert-manager is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
  • Secrets flow: Doppler → ClusterSecretStore (doppler-hetznerterra) → ExternalSecret resources → k8s Secrets.
  • Rancher is reachable only over Tailscale at https://rancher.silverside-gopher.ts.net/.

Important Files

  • terraform/main.tf — provider and version pins
  • terraform/variables.tf — input surface and defaults
  • terraform/firewall.tf — firewall rules (tailnet CIDR, internal cluster ports)
  • ansible/site.yml — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → private-access → doppler → tailscale-cleanup)
  • ansible/generate_inventory.py — renders ansible/inventory.ini from Terraform outputs via Jinja2
  • clusters/prod/flux-system/ — Flux GitRepository and top-level Kustomization resources
  • infrastructure/addons/kustomization.yaml — root addon graph with dependency ordering
  • infrastructure/addons/<addon>/ — each addon is a self-contained dir with its own kustomization.yaml
  • .gitea/workflows/deploy.yml — canonical CI: terraform → ansible → flux bootstrap → rancher fix → B2 restore

Build / Validate / Test

Terraform

  • Format: terraform -chdir=terraform fmt -recursive
  • Check formatting: terraform -chdir=terraform fmt -check -recursive
  • Validate: terraform -chdir=terraform validate
  • Plan (full): terraform -chdir=terraform plan -var-file=../terraform.tfvars
  • Plan one resource: terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]
  • Apply: terraform -chdir=terraform apply -var-file=../terraform.tfvars
  • State inspection: terraform -chdir=terraform state list / terraform state show <address>

Ansible

  • Install collections: ansible-galaxy collection install -r ansible/requirements.yml
  • Generate inventory: cd ansible && python3 generate_inventory.py (requires Terraform outputs)
  • Syntax check: ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check
  • Dry-run one host: ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]
  • Full bootstrap: ansible-playbook ansible/site.yml
  • Targeted: ansible-playbook ansible/site.yml -t upgrade or -t reset
  • Dashboards only: ansible-playbook ansible/dashboards.yml

Python

  • Syntax check: python3 -m py_compile ansible/generate_inventory.py
  • Run: cd ansible && python3 generate_inventory.py

Kubernetes / Flux manifests

  • Render single addon: kubectl kustomize infrastructure/addons/<addon>
  • Render cluster bootstrap: kubectl kustomize clusters/prod/flux-system
  • Validate only the directory you edited, not the whole repo.

Kubeconfig refresh

  • Preferred: scripts/refresh-kubeconfig.sh <cp1-public-ip>
  • Manual: ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig

Code Style

General

  • Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
  • Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
  • No comments unless the logic is non-obvious.

Terraform / HCL

  • 2-space indent. terraform {} block first, then providers, locals, variables, resources, outputs.
  • snake_case for variables, locals, resources. Descriptions on all variables/outputs.
  • sensitive = true on secrets. Run terraform fmt instead of hand-formatting.
  • Use locals for reused or non-trivial logic. Explicit depends_on only when required.

Ansible / YAML

  • 2-space YAML indent. Descriptive task names in sentence case.
  • Idempotent tasks: changed_when: false and failed_when: false for probes.
  • command/shell only when no dedicated module fits. shell only for pipes/redirection/heredocs.
  • when guards and default(...) filters over duplicated tasks.
  • Role names and filenames: kebab-case. Variables: snake_case.
  • Multi-line shell in workflows: set -e or set -euo pipefail for fail-fast.

Kubernetes / Flux YAML

  • One object per file. Kebab-case filenames matching repo patterns: helmrelease-*.yaml, kustomization-*.yaml, *-externalsecret.yaml.
  • Addon manifests live in infrastructure/addons/<addon>/ with a kustomization.yaml.
  • Flux graph objects in clusters/prod/flux-system/.
  • Each addon gets a kustomization-<addon>.yaml entry in infrastructure/addons/ with dependsOn for ordering.
  • Quote strings with :, *, cron expressions, or shell-sensitive chars.
  • Preserve existing labels/annotations unless the change specifically needs them.

Python

  • PEP 8. Imports ordered: stdlib, third-party, local. snake_case for functions/variables.
  • Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.

Known Issues & Workarounds

  • rancher-backup post-install job (rancher-backup-patch-sa) fails because rancher/kuberlr-kubectl can't download kubectl. CI patches the SA and deletes the failed job. Do NOT set s3 block in HelmRelease values — put S3 config in the Backup CR instead.
  • B2 ExternalSecret must use key names accessKey and secretKey (not aws_access_key_id/aws_secret_access_key).
  • Stale Tailscale devices: After cluster rebuild, delete stale offline rancher devices before booting. The tailscale-cleanup Ansible role handles this via the Tailscale API.
  • Restricted B2 keys: b2_authorize_account may return allowed.bucketId: null. CI falls back to b2_list_buckets to resolve bucket ID by name.

Secrets / Security

  • Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
  • Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
  • terraform.tfvars and outputs/ are gitignored. Never print secret values in logs or commits.

CI Pipeline (.gitea/workflows/deploy.yml)

  1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
  2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
  3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
  4. Rancher post-install: wait for Rancher/backup operator → patch SA → clean failed jobs → force reconcile
  5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
  6. Health checks: nodes, Flux objects, pods, storage class

Editing Practices

  • Read target file and adjacent patterns before editing.
  • Run the narrowest validation command after edits.
  • If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
  • Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
  • Check git status before and after changes.

Cursor / Copilot Rules

  • No .cursor/rules/, .cursorrules, or .github/copilot-instructions.md files exist.
  • If added later, mirror their guidance here and treat them as authoritative.