From 6c6b9d20cad859f61ff2907926aad111522f3195 Mon Sep 17 00:00:00 2001 From: MichaelFisher1997 Date: Wed, 22 Apr 2026 01:14:21 +0000 Subject: [PATCH] update README --- AGENTS.md | 164 +++++++++++------------------------------------------- 1 file changed, 34 insertions(+), 130 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index c4d07f5..92e1329 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,144 +1,48 @@ # AGENTS.md -Repository guide for agentic contributors working in this repo. +Repository guide for OpenCode sessions in this repo. -## Scope +## Read First -- Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher. -- Primary areas: `terraform/`, `ansible/`, `clusters/`, `infrastructure/`, `apps/`, `.gitea/workflows/`. -- Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer current manifests and workflows as source of truth. -- Keep changes small and reviewable; prefer the narrowest file set that solves the task. +- Trust manifests and workflows over prose when they conflict. +- Highest-value sources: `terraform/main.tf`, `terraform/variables.tf`, `ansible/site.yml`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`, `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `README.md`, `STABLE_BASELINE.md`, `scripts/refresh-kubeconfig.sh`, `scripts/smoke-check-tailnet-services.sh`. -## Architecture +## Current Baseline -- **Terraform** provisions Hetzner servers, network, firewall, load balancer, SSH keys. -- **Ansible** bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token. -- **Flux** reconciles all cluster addons from this repo after Ansible hands off. -- **Rancher** stores state in embedded etcd (NOT an external DB). Backup/restore uses the `rancher-backup` operator to B2. -- **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS. -- **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets. -- Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/`. -- Grafana and Prometheus are exposed via dedicated Tailscale LoadBalancer services at `http://grafana.silverside-gopher.ts.net/` and `http://prometheus.silverside-gopher.ts.net:9090/`. +- HA private cluster: 3 control planes, 3 workers. +- Tailscale is the private access path for Rancher and shared services. +- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed. +- `apps/` is suspended by default. +- Rancher stores state in embedded etcd; backup/restore uses `rancher-backup` to B2. -## Important Files +## Common Commands -- `terraform/main.tf` — provider and version pins -- `terraform/variables.tf` — input surface and defaults -- `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports) -- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup) -- `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2 -- `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources -- `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering -- `infrastructure/addons//` — each addon is a self-contained dir with its own `kustomization.yaml` -- `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks +- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars` +- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml` +- Flux/Kustomize: `kubectl kustomize infrastructure/addons/`, `kubectl kustomize clusters/prod/flux-system` +- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh ` +- Tailnet smoke check: `ssh root@ 'bash -s' < scripts/smoke-check-tailnet-services.sh` -## Build / Validate / Test +## Workflow Rules -### Terraform +- Keep diffs small and validate only the directory you edited. +- Update manifests and docs together when behavior changes. +- Use `set -euo pipefail` in workflow shell blocks. +- CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks. +- One object per Kubernetes YAML file; keep filenames kebab-case. +- If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP. -- Format: `terraform -chdir=terraform fmt -recursive` -- Check formatting: `terraform -chdir=terraform fmt -check -recursive` -- Validate: `terraform -chdir=terraform validate` -- Plan (full): `terraform -chdir=terraform plan -var-file=../terraform.tfvars` -- Plan one resource: `terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]` -- Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars` -- State inspection: `terraform -chdir=terraform state list` / `terraform state show
` +## Repo-Specific Gotchas -### Ansible +- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR. +- Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted. +- Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it. +- Current private URLs: + - Rancher: `https://rancher.silverside-gopher.ts.net/` + - Grafana: `http://grafana.silverside-gopher.ts.net/` + - Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/` -- Install collections: `ansible-galaxy collection install -r ansible/requirements.yml` -- Generate inventory: `cd ansible && python3 generate_inventory.py` (requires Terraform outputs) -- Syntax check: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check` -- Dry-run one host: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]` -- Full bootstrap: `ansible-playbook ansible/site.yml` -- Targeted: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset` -- Dashboards only: `ansible-playbook ansible/dashboards.yml` +## Secrets -### Python - -- Syntax check: `python3 -m py_compile ansible/generate_inventory.py` -- Run: `cd ansible && python3 generate_inventory.py` - -### Kubernetes / Flux manifests - -- Render single addon: `kubectl kustomize infrastructure/addons/` -- Render cluster bootstrap: `kubectl kustomize clusters/prod/flux-system` -- Validate only the directory you edited, not the whole repo. - -### Kubeconfig refresh - -- Preferred: `scripts/refresh-kubeconfig.sh ` -- Manual: `ssh -i ~/.ssh/infra root@ "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1//g' > outputs/kubeconfig` - -## Code Style - -### General - -- Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them. -- Prefer ASCII. Keep diffs minimal. No unrelated cleanup. -- No comments unless the logic is non-obvious. - -### Terraform / HCL - -- 2-space indent. `terraform {}` block first, then providers, locals, variables, resources, outputs. -- `snake_case` for variables, locals, resources. Descriptions on all variables/outputs. -- `sensitive = true` on secrets. Run `terraform fmt` instead of hand-formatting. -- Use `locals` for reused or non-trivial logic. Explicit `depends_on` only when required. - -### Ansible / YAML - -- 2-space YAML indent. Descriptive task names in sentence case. -- Idempotent tasks: `changed_when: false` and `failed_when: false` for probes. -- `command`/`shell` only when no dedicated module fits. `shell` only for pipes/redirection/heredocs. -- `when` guards and `default(...)` filters over duplicated tasks. -- Role names and filenames: kebab-case. Variables: snake_case. -- Multi-line shell in workflows: `set -e` or `set -euo pipefail` for fail-fast. - -### Kubernetes / Flux YAML - -- One object per file. Kebab-case filenames matching repo patterns: `helmrelease-*.yaml`, `kustomization-*.yaml`, `*-externalsecret.yaml`. -- Addon manifests live in `infrastructure/addons//` with a `kustomization.yaml`. -- Flux graph objects in `clusters/prod/flux-system/`. -- Each addon gets a `kustomization-.yaml` entry in `infrastructure/addons/` with `dependsOn` for ordering. -- Quote strings with `:`, `*`, cron expressions, or shell-sensitive chars. -- Preserve existing labels/annotations unless the change specifically needs them. - -### Python - -- PEP 8. Imports ordered: stdlib, third-party, local. `snake_case` for functions/variables. -- Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling. - -## Known Issues & Workarounds - -- **rancher-backup post-install job** (`rancher-backup-patch-sa`) uses a postRenderer in the HelmRelease to replace the broken `rancher/kuberlr-kubectl` image with `rancher/kubectl`. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead. -- **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id`/`aws_secret_access_key`). -- **Stale Tailscale devices**: Before service proxies exist, `tailscale-cleanup` removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices via the Tailscale API. Once the cluster has Tailscale services, cleanup is skipped so live proxy nodes are not deleted out from under the operator. -- **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null`. CI falls back to `b2_list_buckets` to resolve bucket ID by name. - -## Secrets / Security - -- Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets. -- Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator. -- `terraform.tfvars` and `outputs/` are gitignored. Never print secret values in logs or commits. - -## CI Pipeline (`.gitea/workflows/deploy.yml`) - -1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only) -2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea) -3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons -4. Rancher wait: wait for Rancher and backup operator to be ready -5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready -6. Health checks: nodes, Flux objects, pods, storage class, and Tailscale URL smoke tests from `control_plane[0]` - -## Editing Practices - -- Read target file and adjacent patterns before editing. -- Run the narrowest validation command after edits. -- If you make a live-cluster workaround, also update the declarative manifests so Flux can own it. -- Changes spanning Terraform + Ansible + Flux: update and verify each layer separately. -- Check `git status` before and after changes. - -## Cursor / Copilot Rules - -- No `.cursor/rules/`, `.cursorrules`, or `.github/copilot-instructions.md` files exist. -- If added later, mirror their guidance here and treat them as authoritative. +- Runtime secrets live in Doppler + External Secrets. +- Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys.