2026-03-28 22:13:37 +00:00
# AGENTS.md
Repository guide for agentic contributors working in this repo.
## Scope
2026-03-31 02:46:55 +00:00
- Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
- Primary areas: `terraform/` , `ansible/` , `clusters/` , `infrastructure/` , `apps/` , `.gitea/workflows/` .
- Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer current manifests and workflows as source of truth.
2026-03-28 22:13:37 +00:00
- Keep changes small and reviewable; prefer the narrowest file set that solves the task.
2026-03-31 02:46:55 +00:00
## Architecture
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
- **Terraform** provisions Hetzner servers, network, firewall, load balancer, SSH keys.
- **Ansible** bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
- **Flux** reconciles all cluster addons from this repo after Ansible hands off.
- **Rancher** stores state in embedded etcd (NOT an external DB). Backup/restore uses the `rancher-backup` operator to B2.
- **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
- **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets.
- Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/` .
2026-03-28 22:13:37 +00:00
## Important Files
2026-03-31 02:46:55 +00:00
- `terraform/main.tf` — provider and version pins
- `terraform/variables.tf` — input surface and defaults
- `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports)
- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → private-access → doppler → tailscale-cleanup)
- `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2
- `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources
- `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering
- `infrastructure/addons/<addon>/` — each addon is a self-contained dir with its own `kustomization.yaml`
- `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → rancher fix → B2 restore
2026-03-28 22:13:37 +00:00
## Build / Validate / Test
### Terraform
2026-03-31 02:46:55 +00:00
- Format: `terraform -chdir=terraform fmt -recursive`
2026-03-28 22:13:37 +00:00
- Check formatting: `terraform -chdir=terraform fmt -check -recursive`
2026-03-31 02:46:55 +00:00
- Validate: `terraform -chdir=terraform validate`
- Plan (full): `terraform -chdir=terraform plan -var-file=../terraform.tfvars`
2026-03-28 22:13:37 +00:00
- Plan one resource: `terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]`
2026-03-31 02:46:55 +00:00
- Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
- State inspection: `terraform -chdir=terraform state list` / `terraform state show <address>`
2026-03-28 22:13:37 +00:00
### Ansible
- Install collections: `ansible-galaxy collection install -r ansible/requirements.yml`
2026-03-31 02:46:55 +00:00
- Generate inventory: `cd ansible && python3 generate_inventory.py` (requires Terraform outputs)
2026-03-28 22:13:37 +00:00
- Syntax check: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`
- Dry-run one host: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]`
2026-03-31 02:46:55 +00:00
- Full bootstrap: `ansible-playbook ansible/site.yml`
- Targeted: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset`
2026-03-28 22:13:37 +00:00
- Dashboards only: `ansible-playbook ansible/dashboards.yml`
### Python
2026-03-31 02:46:55 +00:00
- Syntax check: `python3 -m py_compile ansible/generate_inventory.py`
- Run: `cd ansible && python3 generate_inventory.py`
2026-03-28 22:13:37 +00:00
### Kubernetes / Flux manifests
2026-03-31 02:46:55 +00:00
- Render single addon: `kubectl kustomize infrastructure/addons/<addon>`
- Render cluster bootstrap: `kubectl kustomize clusters/prod/flux-system`
- Validate only the directory you edited, not the whole repo.
2026-03-28 22:13:37 +00:00
2026-03-29 03:31:36 +00:00
### Kubeconfig refresh
2026-03-31 02:46:55 +00:00
- Preferred: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
- Manual: `ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig`
2026-03-29 03:31:36 +00:00
2026-03-28 22:13:37 +00:00
## Code Style
### General
2026-03-31 02:46:55 +00:00
- Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
- Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
- No comments unless the logic is non-obvious.
2026-03-28 22:13:37 +00:00
### Terraform / HCL
2026-03-31 02:46:55 +00:00
- 2-space indent. `terraform {}` block first, then providers, locals, variables, resources, outputs.
- `snake_case` for variables, locals, resources. Descriptions on all variables/outputs.
- `sensitive = true` on secrets. Run `terraform fmt` instead of hand-formatting.
- Use `locals` for reused or non-trivial logic. Explicit `depends_on` only when required.
2026-03-28 22:13:37 +00:00
### Ansible / YAML
2026-03-31 02:46:55 +00:00
- 2-space YAML indent. Descriptive task names in sentence case.
- Idempotent tasks: `changed_when: false` and `failed_when: false` for probes.
- `command` /`shell` only when no dedicated module fits. `shell` only for pipes/redirection/heredocs.
- `when` guards and `default(...)` filters over duplicated tasks.
- Role names and filenames: kebab-case. Variables: snake_case.
- Multi-line shell in workflows: `set -e` or `set -euo pipefail` for fail-fast.
2026-03-28 22:13:37 +00:00
### Kubernetes / Flux YAML
2026-03-31 02:46:55 +00:00
- One object per file. Kebab-case filenames matching repo patterns: `helmrelease-*.yaml` , `kustomization-*.yaml` , `*-externalsecret.yaml` .
- Addon manifests live in `infrastructure/addons/<addon>/` with a `kustomization.yaml` .
- Flux graph objects in `clusters/prod/flux-system/` .
- Each addon gets a `kustomization-<addon>.yaml` entry in `infrastructure/addons/` with `dependsOn` for ordering.
- Quote strings with `:` , `*` , cron expressions, or shell-sensitive chars.
2026-03-28 22:13:37 +00:00
- Preserve existing labels/annotations unless the change specifically needs them.
### Python
2026-03-31 02:46:55 +00:00
- PEP 8. Imports ordered: stdlib, third-party, local. `snake_case` for functions/variables.
- Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
## Known Issues & Workarounds
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
- **rancher-backup post-install job** (`rancher-backup-patch-sa` ) fails because `rancher/kuberlr-kubectl` can't download kubectl. CI patches the SA and deletes the failed job. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead.
- **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id` /`aws_secret_access_key` ).
- **Stale Tailscale devices**: After cluster rebuild, delete stale offline `rancher` devices before booting. The `tailscale-cleanup` Ansible role handles this via the Tailscale API.
- **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null` . CI falls back to `b2_list_buckets` to resolve bucket ID by name.
2026-03-28 22:13:37 +00:00
## Secrets / Security
- Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
2026-03-31 02:46:55 +00:00
- Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
- `terraform.tfvars` and `outputs/` are gitignored. Never print secret values in logs or commits.
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
## CI Pipeline (`.gitea/workflows/deploy.yml`)
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
4. Rancher post-install: wait for Rancher/backup operator → patch SA → clean failed jobs → force reconcile
5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
6. Health checks: nodes, Flux objects, pods, storage class
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
## Editing Practices
2026-03-28 22:13:37 +00:00
2026-03-31 02:46:55 +00:00
- Read target file and adjacent patterns before editing.
- Run the narrowest validation command after edits.
- If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
- Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
- Check `git status` before and after changes.
2026-03-28 22:13:37 +00:00
## Cursor / Copilot Rules
2026-03-31 02:46:55 +00:00
- No `.cursor/rules/` , `.cursorrules` , or `.github/copilot-instructions.md` files exist.
- If added later, mirror their guidance here and treat them as authoritative.