From 569d741751f97e9cb92cc4a0d100a38875b8a8e6 Mon Sep 17 00:00:00 2001
From: MichaelFisher1997 <contact@michaelfisher.tech>
Date: Tue, 31 Mar 2026 02:46:55 +0000
Subject: [PATCH] push

---
 AGENTS.md | 169 +++++++++++++++++++++++++-----------------------------
 1 file changed, 78 insertions(+), 91 deletions(-)
diff --git a/AGENTS.md b/AGENTS.md
index 42b6f63..69cbcbd 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -4,153 +4,140 @@ Repository guide for agentic contributors working in this repo.
 
 ## Scope
 
-- This is an infrastructure repository for a Hetzner + k3s + Flux stack.
-- Primary areas: `terraform/`, `ansible/`, `clusters/`, `infrastructure/`, `.gitea/workflows/`.
-- Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer the repo's current manifests and workflows as the source of truth.
+- Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
+- Primary areas: `terraform/`, `ansible/`, `clusters/`, `infrastructure/`, `apps/`, `.gitea/workflows/`.
+- Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer current manifests and workflows as source of truth.
 - Keep changes small and reviewable; prefer the narrowest file set that solves the task.
 
-## Current Tooling
+## Architecture
 
-- Terraform for cloud infra and state-backed provisioning.
-- Ansible for bootstrap, OS prep, k3s install, and pre-Flux prerequisites.
-- Flux/Kustomize for cluster and addon reconciliation.
-- Python for inventory generation (`ansible/generate_inventory.py`).
+- **Terraform** provisions Hetzner servers, network, firewall, load balancer, SSH keys.
+- **Ansible** bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
+- **Flux** reconciles all cluster addons from this repo after Ansible hands off.
+- **Rancher** stores state in embedded etcd (NOT an external DB). Backup/restore uses the `rancher-backup` operator to B2.
+- **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
+- **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets.
+- Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/`.
 
 ## Important Files
 
-- `terraform/main.tf` - provider and version pins.
-- `terraform/variables.tf` - input surface and defaults.
-- `terraform/*.tf` - Hetzner network, firewall, servers, SSH, outputs.
-- `ansible/site.yml` - ordered bootstrap playbook.
-- `ansible/generate_inventory.py` - renders `ansible/inventory.ini` from Terraform outputs.
-- `clusters/prod/flux-system/` - Flux source and top-level reconciliation graph.
-- `infrastructure/addons/<addon>/` - Flux-managed addon manifests.
-- `.gitea/workflows/*.yml` - CI/CD entry points and the best reference for expected commands.
+- `terraform/main.tf` — provider and version pins
+- `terraform/variables.tf` — input surface and defaults
+- `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports)
+- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → private-access → doppler → tailscale-cleanup)
+- `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2
+- `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources
+- `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering
+- `infrastructure/addons/<addon>/` — each addon is a self-contained dir with its own `kustomization.yaml`
+- `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → rancher fix → B2 restore
 
 ## Build / Validate / Test
 
 ### Terraform
 
-- Format all Terraform: `terraform -chdir=terraform fmt -recursive`
+- Format: `terraform -chdir=terraform fmt -recursive`
 - Check formatting: `terraform -chdir=terraform fmt -check -recursive`
-- Validate config: `terraform -chdir=terraform validate`
-- Full plan: `terraform -chdir=terraform plan -var-file=../terraform.tfvars`
-- Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
-- Destroy: `terraform -chdir=terraform destroy -var-file=../terraform.tfvars`
-
-### Terraform, single-target / focused checks
-
+- Validate: `terraform -chdir=terraform validate`
+- Plan (full): `terraform -chdir=terraform plan -var-file=../terraform.tfvars`
 - Plan one resource: `terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]`
-- Import/check existing state: use `terraform state list` and `terraform state show <address>` before editing imports.
-- If you touch only Terraform formatting, run `terraform fmt -check -recursive` first.
+- Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
+- State inspection: `terraform -chdir=terraform state list` / `terraform state show <address>`
 
 ### Ansible
 
 - Install collections: `ansible-galaxy collection install -r ansible/requirements.yml`
-- Generate inventory: `cd ansible && python3 generate_inventory.py`
+- Generate inventory: `cd ansible && python3 generate_inventory.py` (requires Terraform outputs)
 - Syntax check: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`
 - Dry-run one host: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]`
-- Run the bootstrap playbook: `ansible-playbook ansible/site.yml`
-- Targeted maintenance: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset`
+- Full bootstrap: `ansible-playbook ansible/site.yml`
+- Targeted: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset`
 - Dashboards only: `ansible-playbook ansible/dashboards.yml`
 
 ### Python
 
-- Syntax check the inventory generator: `python3 -m py_compile ansible/generate_inventory.py`
-- If you modify the script, run it after Terraform outputs exist: `cd ansible && python3 generate_inventory.py`.
+- Syntax check: `python3 -m py_compile ansible/generate_inventory.py`
+- Run: `cd ansible && python3 generate_inventory.py`
 
 ### Kubernetes / Flux manifests
 
-- Render a single addon: `kubectl kustomize infrastructure/addons/<addon>`
-- Render cluster bootstrap objects: `kubectl kustomize clusters/prod/flux-system`
-- Prefer validating the exact directory you edited, not the whole repo, unless the change is cross-cutting.
-- For Flux changes, verify the relevant `Kustomization`/`HelmRelease`/`ExternalSecret` manifests render cleanly before committing.
+- Render single addon: `kubectl kustomize infrastructure/addons/<addon>`
+- Render cluster bootstrap: `kubectl kustomize clusters/prod/flux-system`
+- Validate only the directory you edited, not the whole repo.
 
 ### Kubeconfig refresh
 
-After a full cluster rebuild, the kubeconfig goes stale (new certs, new IPs). Refresh it with:
-- `scripts/refresh-kubeconfig.sh <cp1-public-ip>` (preferred)
-- Or manually: `ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig`
-- The Ansible `site.yml` Finalize step also rewrites the server address to the public IP during bootstrap.
+- Preferred: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
+- Manual: `ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig`
 
 ## Code Style
 
 ### General
 
-- Match the existing style in adjacent files.
-- Prefer ASCII unless the file already uses Unicode or a Unicode character is necessary.
-- Do not introduce new tools, frameworks, or abstractions unless the repo already uses them.
-- Keep diffs minimal and avoid unrelated cleanup.
+- Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
+- Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
+- No comments unless the logic is non-obvious.
 
 ### Terraform / HCL
 
-- Use 2-space indentation.
-- Keep `terraform {}` blocks first, then providers, locals, variables, resources, and outputs in a logical order.
-- Name variables, locals, and resources in `snake_case`.
-- Keep descriptions on variables and outputs.
-- Mark sensitive values with `sensitive = true`.
-- Use aligned `=` formatting when practical; run `terraform fmt` instead of hand-formatting.
-- Prefer explicit `depends_on` only when required.
-- Keep logic in `locals` if it is reused or non-trivial.
+- 2-space indent. `terraform {}` block first, then providers, locals, variables, resources, outputs.
+- `snake_case` for variables, locals, resources. Descriptions on all variables/outputs.
+- `sensitive = true` on secrets. Run `terraform fmt` instead of hand-formatting.
+- Use `locals` for reused or non-trivial logic. Explicit `depends_on` only when required.
 
 ### Ansible / YAML
 
-- Use 2-space YAML indentation.
-- Use descriptive task names in sentence case (e.g. `Install k3s server`).
-- Keep tasks idempotent; use `changed_when: false` and `failed_when: false` for probes and checks.
-- Use `command`/`shell` only when a dedicated module is not a better fit.
-- Use `shell` only when you need pipes, redirection, heredocs, or shell expansion.
-- Prefer `when` guards and `default(...)` filters over duplicating tasks.
-- Keep role names and file names kebab-case; keep variables snake_case.
-- For multi-line shell snippets in workflows or tasks, use `set -e` or `set -euo pipefail` when the command sequence should fail fast.
+- 2-space YAML indent. Descriptive task names in sentence case.
+- Idempotent tasks: `changed_when: false` and `failed_when: false` for probes.
+- `command`/`shell` only when no dedicated module fits. `shell` only for pipes/redirection/heredocs.
+- `when` guards and `default(...)` filters over duplicated tasks.
+- Role names and filenames: kebab-case. Variables: snake_case.
+- Multi-line shell in workflows: `set -e` or `set -euo pipefail` for fail-fast.
 
 ### Kubernetes / Flux YAML
 
-- Keep one Kubernetes object per file unless the repo already groups a small set of tightly related objects.
-- Use kebab-case filenames that match the repo pattern (`helmrelease-*.yaml`, `kustomization-*.yaml`, `*-externalsecret.yaml`).
-- Keep addon manifests under `infrastructure/addons/<addon>/` with a nested `kustomization.yaml`.
-- Keep Flux graph objects in `clusters/prod/flux-system/`.
-- Quote strings that contain `:`, `*`, cron expressions, or shell-sensitive characters.
+- One object per file. Kebab-case filenames matching repo patterns: `helmrelease-*.yaml`, `kustomization-*.yaml`, `*-externalsecret.yaml`.
+- Addon manifests live in `infrastructure/addons/<addon>/` with a `kustomization.yaml`.
+- Flux graph objects in `clusters/prod/flux-system/`.
+- Each addon gets a `kustomization-<addon>.yaml` entry in `infrastructure/addons/` with `dependsOn` for ordering.
+- Quote strings with `:`, `*`, cron expressions, or shell-sensitive chars.
 - Preserve existing labels/annotations unless the change specifically needs them.
 
 ### Python
 
-- Follow PEP 8 style and keep imports ordered: stdlib, third-party, local.
-- Use `snake_case` for functions and variables.
-- Keep scripts small and explicit; exit non-zero on failure.
-- Prefer clear subprocess error handling over silent failures.
+- PEP 8. Imports ordered: stdlib, third-party, local. `snake_case` for functions/variables.
+- Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.
 
-## Editing Practices
+## Known Issues & Workarounds
 
-- Read the target file and adjacent patterns before editing.
-- Preserve user changes; do not overwrite unrelated diffs.
-- Prefer `apply_patch` for small single-file edits.
-- Use scripting only when it is cleaner than repeated manual edits.
-- Keep comments minimal and only add them for non-obvious logic.
+- **rancher-backup post-install job** (`rancher-backup-patch-sa`) fails because `rancher/kuberlr-kubectl` can't download kubectl. CI patches the SA and deletes the failed job. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead.
+- **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id`/`aws_secret_access_key`).
+- **Stale Tailscale devices**: After cluster rebuild, delete stale offline `rancher` devices before booting. The `tailscale-cleanup` Ansible role handles this via the Tailscale API.
+- **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null`. CI falls back to `b2_list_buckets` to resolve bucket ID by name.
 
 ## Secrets / Security
 
 - Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
-- Use Gitea secrets, Doppler, or External Secrets for runtime secrets.
-- Avoid printing secret values in logs, comments, or commit messages.
-- If you must inspect a secret locally, only verify shape/length or compare values indirectly.
+- Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
+- `terraform.tfvars` and `outputs/` are gitignored. Never print secret values in logs or commits.
 
-## Workflow Expectations
+## CI Pipeline (`.gitea/workflows/deploy.yml`)
 
-- Read the target file and nearby patterns before editing.
-- Check `git status` before and after your changes.
-- Run the narrowest relevant validation command after edits.
+1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
+2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
+3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
+4. Rancher post-install: wait for Rancher/backup operator → patch SA → clean failed jobs → force reconcile
+5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
+6. Health checks: nodes, Flux objects, pods, storage class
+
+## Editing Practices
+
+- Read target file and adjacent patterns before editing.
+- Run the narrowest validation command after edits.
 - If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
-- Do not overwrite user changes you did not make.
-- If a change spans Terraform + Ansible + Flux, update and verify each layer separately.
-
-## CI / Workflow Notes
-
-- CI currently uses `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, and `.gitea/workflows/dashboards.yml` as the canonical automation references.
-- The workflows run `terraform fmt -check -recursive`, `terraform validate`, Terraform plan/apply, Ansible bootstrap, and targeted Flux bootstrap steps.
-- If you change workflow behavior, keep the repo docs and the workflow commands in sync.
+- Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
+- Check `git status` before and after changes.
 
 ## Cursor / Copilot Rules
 
-- No `.cursor/rules/`, `.cursorrules`, or `.github/copilot-instructions.md` files were present when this file was created.
-- If those files are added later, mirror their guidance here and treat them as authoritative.
+- No `.cursor/rules/`, `.cursorrules`, or `.github/copilot-instructions.md` files exist.
+- If added later, mirror their guidance here and treat them as authoritative.