9126de1423
Prometheus is exposed on port 9090 through the Tailscale LoadBalancer service, so the configured external URL and repo docs should match the actual address users reach after rebuilds.
8.0 KiB
8.0 KiB
AGENTS.md
Repository guide for agentic contributors working in this repo.
Scope
- Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
- Primary areas:
terraform/,ansible/,clusters/,infrastructure/,apps/,.gitea/workflows/. - Treat
README.mdandSTABLE_BASELINE.mdas user-facing context, but prefer current manifests and workflows as source of truth. - Keep changes small and reviewable; prefer the narrowest file set that solves the task.
Architecture
- Terraform provisions Hetzner servers, network, firewall, load balancer, SSH keys.
- Ansible bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
- Flux reconciles all cluster addons from this repo after Ansible hands off.
- Rancher stores state in embedded etcd (NOT an external DB). Backup/restore uses the
rancher-backupoperator to B2. - cert-manager is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
- Secrets flow: Doppler →
ClusterSecretStore(doppler-hetznerterra) →ExternalSecretresources → k8s Secrets. - Rancher is reachable only over Tailscale at
https://rancher.silverside-gopher.ts.net/. - Grafana, Prometheus, and Flux UI are also exposed via dedicated Tailscale LoadBalancer services at
http://grafana.silverside-gopher.ts.net/,http://prometheus.silverside-gopher.ts.net:9090/,http://flux.silverside-gopher.ts.net:9001/.
Important Files
terraform/main.tf— provider and version pinsterraform/variables.tf— input surface and defaultsterraform/firewall.tf— firewall rules (tailnet CIDR, internal cluster ports)ansible/site.yml— ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup)ansible/generate_inventory.py— rendersansible/inventory.inifrom Terraform outputs via Jinja2clusters/prod/flux-system/— Flux GitRepository and top-level Kustomization resourcesinfrastructure/addons/kustomization.yaml— root addon graph with dependency orderinginfrastructure/addons/<addon>/— each addon is a self-contained dir with its ownkustomization.yaml.gitea/workflows/deploy.yml— canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks
Build / Validate / Test
Terraform
- Format:
terraform -chdir=terraform fmt -recursive - Check formatting:
terraform -chdir=terraform fmt -check -recursive - Validate:
terraform -chdir=terraform validate - Plan (full):
terraform -chdir=terraform plan -var-file=../terraform.tfvars - Plan one resource:
terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0] - Apply:
terraform -chdir=terraform apply -var-file=../terraform.tfvars - State inspection:
terraform -chdir=terraform state list/terraform state show <address>
Ansible
- Install collections:
ansible-galaxy collection install -r ansible/requirements.yml - Generate inventory:
cd ansible && python3 generate_inventory.py(requires Terraform outputs) - Syntax check:
ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check - Dry-run one host:
ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0] - Full bootstrap:
ansible-playbook ansible/site.yml - Targeted:
ansible-playbook ansible/site.yml -t upgradeor-t reset - Dashboards only:
ansible-playbook ansible/dashboards.yml
Python
- Syntax check:
python3 -m py_compile ansible/generate_inventory.py - Run:
cd ansible && python3 generate_inventory.py
Kubernetes / Flux manifests
- Render single addon:
kubectl kustomize infrastructure/addons/<addon> - Render cluster bootstrap:
kubectl kustomize clusters/prod/flux-system - Validate only the directory you edited, not the whole repo.
Kubeconfig refresh
- Preferred:
scripts/refresh-kubeconfig.sh <cp1-public-ip> - Manual:
ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig
Code Style
General
- Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
- Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
- No comments unless the logic is non-obvious.
Terraform / HCL
- 2-space indent.
terraform {}block first, then providers, locals, variables, resources, outputs. snake_casefor variables, locals, resources. Descriptions on all variables/outputs.sensitive = trueon secrets. Runterraform fmtinstead of hand-formatting.- Use
localsfor reused or non-trivial logic. Explicitdepends_ononly when required.
Ansible / YAML
- 2-space YAML indent. Descriptive task names in sentence case.
- Idempotent tasks:
changed_when: falseandfailed_when: falsefor probes. command/shellonly when no dedicated module fits.shellonly for pipes/redirection/heredocs.whenguards anddefault(...)filters over duplicated tasks.- Role names and filenames: kebab-case. Variables: snake_case.
- Multi-line shell in workflows:
set -eorset -euo pipefailfor fail-fast.
Kubernetes / Flux YAML
- One object per file. Kebab-case filenames matching repo patterns:
helmrelease-*.yaml,kustomization-*.yaml,*-externalsecret.yaml. - Addon manifests live in
infrastructure/addons/<addon>/with akustomization.yaml. - Flux graph objects in
clusters/prod/flux-system/. - Each addon gets a
kustomization-<addon>.yamlentry ininfrastructure/addons/withdependsOnfor ordering. - Quote strings with
:,*, cron expressions, or shell-sensitive chars. - Preserve existing labels/annotations unless the change specifically needs them.
Python
- PEP 8. Imports ordered: stdlib, third-party, local.
snake_casefor functions/variables. - Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.
Known Issues & Workarounds
- rancher-backup post-install job (
rancher-backup-patch-sa) uses a postRenderer in the HelmRelease to replace the brokenrancher/kuberlr-kubectlimage withrancher/kubectl. Do NOT sets3block in HelmRelease values — put S3 config in the Backup CR instead. - B2 ExternalSecret must use key names
accessKeyandsecretKey(notaws_access_key_id/aws_secret_access_key). - Stale Tailscale devices: After cluster rebuild, delete stale offline
rancherdevices before booting. Thetailscale-cleanupAnsible role handles this via the Tailscale API. - Restricted B2 keys:
b2_authorize_accountmay returnallowed.bucketId: null. CI falls back tob2_list_bucketsto resolve bucket ID by name.
Secrets / Security
- Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
- Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
terraform.tfvarsandoutputs/are gitignored. Never print secret values in logs or commits.
CI Pipeline (.gitea/workflows/deploy.yml)
- Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
- Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
- Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
- Rancher wait: wait for Rancher and backup operator to be ready
- B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
- Health checks: nodes, Flux objects, pods, storage class
Editing Practices
- Read target file and adjacent patterns before editing.
- Run the narrowest validation command after edits.
- If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
- Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
- Check
git statusbefore and after changes.
Cursor / Copilot Rules
- No
.cursor/rules/,.cursorrules, or.github/copilot-instructions.mdfiles exist. - If added later, mirror their guidance here and treat them as authoritative.