# AGENTS.md Repository guide for agentic contributors working in this repo. ## Scope - Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher. - Primary areas: `terraform/`, `ansible/`, `clusters/`, `infrastructure/`, `apps/`, `.gitea/workflows/`. - Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer current manifests and workflows as source of truth. - Keep changes small and reviewable; prefer the narrowest file set that solves the task. ## Architecture - **Terraform** provisions Hetzner servers, network, firewall, load balancer, SSH keys. - **Ansible** bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token. - **Flux** reconciles all cluster addons from this repo after Ansible hands off. - **Rancher** stores state in embedded etcd (NOT an external DB). Backup/restore uses the `rancher-backup` operator to B2. - **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS. - **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets. - Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/`. - Grafana, Prometheus, and Flux UI are also exposed via dedicated Tailscale LoadBalancer services at `http://grafana.silverside-gopher.ts.net/`, `http://prometheus.silverside-gopher.ts.net/`, `http://flux.silverside-gopher.ts.net:9001/`. ## Important Files - `terraform/main.tf` — provider and version pins - `terraform/variables.tf` — input surface and defaults - `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports) - `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup) - `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2 - `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources - `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering - `infrastructure/addons//` — each addon is a self-contained dir with its own `kustomization.yaml` - `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks ## Build / Validate / Test ### Terraform - Format: `terraform -chdir=terraform fmt -recursive` - Check formatting: `terraform -chdir=terraform fmt -check -recursive` - Validate: `terraform -chdir=terraform validate` - Plan (full): `terraform -chdir=terraform plan -var-file=../terraform.tfvars` - Plan one resource: `terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]` - Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars` - State inspection: `terraform -chdir=terraform state list` / `terraform state show
` ### Ansible - Install collections: `ansible-galaxy collection install -r ansible/requirements.yml` - Generate inventory: `cd ansible && python3 generate_inventory.py` (requires Terraform outputs) - Syntax check: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check` - Dry-run one host: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]` - Full bootstrap: `ansible-playbook ansible/site.yml` - Targeted: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset` - Dashboards only: `ansible-playbook ansible/dashboards.yml` ### Python - Syntax check: `python3 -m py_compile ansible/generate_inventory.py` - Run: `cd ansible && python3 generate_inventory.py` ### Kubernetes / Flux manifests - Render single addon: `kubectl kustomize infrastructure/addons/` - Render cluster bootstrap: `kubectl kustomize clusters/prod/flux-system` - Validate only the directory you edited, not the whole repo. ### Kubeconfig refresh - Preferred: `scripts/refresh-kubeconfig.sh ` - Manual: `ssh -i ~/.ssh/infra root@ "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1//g' > outputs/kubeconfig` ## Code Style ### General - Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them. - Prefer ASCII. Keep diffs minimal. No unrelated cleanup. - No comments unless the logic is non-obvious. ### Terraform / HCL - 2-space indent. `terraform {}` block first, then providers, locals, variables, resources, outputs. - `snake_case` for variables, locals, resources. Descriptions on all variables/outputs. - `sensitive = true` on secrets. Run `terraform fmt` instead of hand-formatting. - Use `locals` for reused or non-trivial logic. Explicit `depends_on` only when required. ### Ansible / YAML - 2-space YAML indent. Descriptive task names in sentence case. - Idempotent tasks: `changed_when: false` and `failed_when: false` for probes. - `command`/`shell` only when no dedicated module fits. `shell` only for pipes/redirection/heredocs. - `when` guards and `default(...)` filters over duplicated tasks. - Role names and filenames: kebab-case. Variables: snake_case. - Multi-line shell in workflows: `set -e` or `set -euo pipefail` for fail-fast. ### Kubernetes / Flux YAML - One object per file. Kebab-case filenames matching repo patterns: `helmrelease-*.yaml`, `kustomization-*.yaml`, `*-externalsecret.yaml`. - Addon manifests live in `infrastructure/addons//` with a `kustomization.yaml`. - Flux graph objects in `clusters/prod/flux-system/`. - Each addon gets a `kustomization-.yaml` entry in `infrastructure/addons/` with `dependsOn` for ordering. - Quote strings with `:`, `*`, cron expressions, or shell-sensitive chars. - Preserve existing labels/annotations unless the change specifically needs them. ### Python - PEP 8. Imports ordered: stdlib, third-party, local. `snake_case` for functions/variables. - Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling. ## Known Issues & Workarounds - **rancher-backup post-install job** (`rancher-backup-patch-sa`) uses a postRenderer in the HelmRelease to replace the broken `rancher/kuberlr-kubectl` image with `rancher/kubectl`. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead. - **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id`/`aws_secret_access_key`). - **Stale Tailscale devices**: After cluster rebuild, delete stale offline `rancher` devices before booting. The `tailscale-cleanup` Ansible role handles this via the Tailscale API. - **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null`. CI falls back to `b2_list_buckets` to resolve bucket ID by name. ## Secrets / Security - Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets. - Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator. - `terraform.tfvars` and `outputs/` are gitignored. Never print secret values in logs or commits. ## CI Pipeline (`.gitea/workflows/deploy.yml`) 1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only) 2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea) 3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons 4. Rancher wait: wait for Rancher and backup operator to be ready 5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready 6. Health checks: nodes, Flux objects, pods, storage class ## Editing Practices - Read target file and adjacent patterns before editing. - Run the narrowest validation command after edits. - If you make a live-cluster workaround, also update the declarative manifests so Flux can own it. - Changes spanning Terraform + Ansible + Flux: update and verify each layer separately. - Check `git status` before and after changes. ## Cursor / Copilot Rules - No `.cursor/rules/`, `.cursorrules`, or `.github/copilot-instructions.md` files exist. - If added later, mirror their guidance here and treat them as authoritative.