update README

2026-04-22 01:14:21 +00:00
parent c3a2f25c94
commit 6c6b9d20ca
1 changed files with 34 additions and 130 deletions
@@ -1,144 +1,48 @@
 # AGENTS.md
-Repository guide for agentic contributors working in this repo.
+Repository guide for OpenCode sessions in this repo.
-## Scope
+## Read First
- Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
+- Trust manifests and workflows over prose when they conflict.
- Primary areas: `terraform/`, `ansible/`, `clusters/`, `infrastructure/`, `apps/`, `.gitea/workflows/`.
+- Highest-value sources: `terraform/main.tf`, `terraform/variables.tf`, `ansible/site.yml`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`, `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `README.md`, `STABLE_BASELINE.md`, `scripts/refresh-kubeconfig.sh`, `scripts/smoke-check-tailnet-services.sh`.
 - Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer current manifests and workflows as source of truth.
 - Keep changes small and reviewable; prefer the narrowest file set that solves the task.
-## Architecture
+## Current Baseline
- **Terraform** provisions Hetzner servers, network, firewall, load balancer, SSH keys.
+- HA private cluster: 3 control planes, 3 workers.
- **Ansible** bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
+- Tailscale is the private access path for Rancher and shared services.
- **Flux** reconciles all cluster addons from this repo after Ansible hands off.
+- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
- **Rancher** stores state in embedded etcd (NOT an external DB). Backup/restore uses the `rancher-backup` operator to B2.
+- `apps/` is suspended by default.
- **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
+- Rancher stores state in embedded etcd; backup/restore uses `rancher-backup` to B2.
 - **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets.
 - Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/`.
 - Grafana and Prometheus are exposed via dedicated Tailscale LoadBalancer services at `http://grafana.silverside-gopher.ts.net/` and `http://prometheus.silverside-gopher.ts.net:9090/`.
-## Important Files
+## Common Commands
- `terraform/main.tf` — provider and version pins
+- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
- `terraform/variables.tf` — input surface and defaults
+- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml`
- `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports)
+- Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system`
- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup)
+- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
- `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2
+- Tailnet smoke check: `ssh root@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
 - `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources
 - `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering
 - `infrastructure/addons/<addon>/` — each addon is a self-contained dir with its own `kustomization.yaml`
 - `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks
-## Build / Validate / Test
+## Workflow Rules
-### Terraform
+- Keep diffs small and validate only the directory you edited.
 - Update manifests and docs together when behavior changes.
 - Use `set -euo pipefail` in workflow shell blocks.
 - CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
 - One object per Kubernetes YAML file; keep filenames kebab-case.
 - If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP.
- Format: `terraform -chdir=terraform fmt -recursive`
+## Repo-Specific Gotchas
 - Check formatting: `terraform -chdir=terraform fmt -check -recursive`
 - Validate: `terraform -chdir=terraform validate`
 - Plan (full): `terraform -chdir=terraform plan -var-file=../terraform.tfvars`
 - Plan one resource: `terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]`
 - Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
 - State inspection: `terraform -chdir=terraform state list` / `terraform state show <address>`
-### Ansible
+- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR.
 - Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted.
 - Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it.
 - Current private URLs:
  - Rancher: `https://rancher.silverside-gopher.ts.net/`
  - Grafana: `http://grafana.silverside-gopher.ts.net/`
  - Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
- Install collections: `ansible-galaxy collection install -r ansible/requirements.yml`
+## Secrets
 - Generate inventory: `cd ansible && python3 generate_inventory.py` (requires Terraform outputs)
 - Syntax check: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`
 - Dry-run one host: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]`
 - Full bootstrap: `ansible-playbook ansible/site.yml`
 - Targeted: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset`
 - Dashboards only: `ansible-playbook ansible/dashboards.yml`
-### Python
+- Runtime secrets live in Doppler + External Secrets.
-
+- Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys.
 - Syntax check: `python3 -m py_compile ansible/generate_inventory.py`
 - Run: `cd ansible && python3 generate_inventory.py`
 ### Kubernetes / Flux manifests
 - Render single addon: `kubectl kustomize infrastructure/addons/<addon>`
 - Render cluster bootstrap: `kubectl kustomize clusters/prod/flux-system`
 - Validate only the directory you edited, not the whole repo.
 ### Kubeconfig refresh
 - Preferred: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
 - Manual: `ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig`
 ## Code Style
 ### General
 - Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
 - Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
 - No comments unless the logic is non-obvious.
 ### Terraform / HCL
 - 2-space indent. `terraform {}` block first, then providers, locals, variables, resources, outputs.
 - `snake_case` for variables, locals, resources. Descriptions on all variables/outputs.
 - `sensitive = true` on secrets. Run `terraform fmt` instead of hand-formatting.
 - Use `locals` for reused or non-trivial logic. Explicit `depends_on` only when required.
 ### Ansible / YAML
 - 2-space YAML indent. Descriptive task names in sentence case.
 - Idempotent tasks: `changed_when: false` and `failed_when: false` for probes.
 - `command`/`shell` only when no dedicated module fits. `shell` only for pipes/redirection/heredocs.
 - `when` guards and `default(...)` filters over duplicated tasks.
 - Role names and filenames: kebab-case. Variables: snake_case.
 - Multi-line shell in workflows: `set -e` or `set -euo pipefail` for fail-fast.
 ### Kubernetes / Flux YAML
 - One object per file. Kebab-case filenames matching repo patterns: `helmrelease-*.yaml`, `kustomization-*.yaml`, `*-externalsecret.yaml`.
 - Addon manifests live in `infrastructure/addons/<addon>/` with a `kustomization.yaml`.
 - Flux graph objects in `clusters/prod/flux-system/`.
 - Each addon gets a `kustomization-<addon>.yaml` entry in `infrastructure/addons/` with `dependsOn` for ordering.
 - Quote strings with `:`, `*`, cron expressions, or shell-sensitive chars.
 - Preserve existing labels/annotations unless the change specifically needs them.
 ### Python
 - PEP 8. Imports ordered: stdlib, third-party, local. `snake_case` for functions/variables.
 - Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.
 ## Known Issues & Workarounds
 - **rancher-backup post-install job** (`rancher-backup-patch-sa`) uses a postRenderer in the HelmRelease to replace the broken `rancher/kuberlr-kubectl` image with `rancher/kubectl`. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead.
 - **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id`/`aws_secret_access_key`).
 - **Stale Tailscale devices**: Before service proxies exist, `tailscale-cleanup` removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices via the Tailscale API. Once the cluster has Tailscale services, cleanup is skipped so live proxy nodes are not deleted out from under the operator.
 - **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null`. CI falls back to `b2_list_buckets` to resolve bucket ID by name.
 ## Secrets / Security
 - Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
 - Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
 - `terraform.tfvars` and `outputs/` are gitignored. Never print secret values in logs or commits.
 ## CI Pipeline (`.gitea/workflows/deploy.yml`)
 1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
 2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
 3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
 4. Rancher wait: wait for Rancher and backup operator to be ready
 5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
 6. Health checks: nodes, Flux objects, pods, storage class, and Tailscale URL smoke tests from `control_plane[0]`
 ## Editing Practices
 - Read target file and adjacent patterns before editing.
 - Run the narrowest validation command after edits.
 - If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
 - Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
 - Check `git status` before and after changes.
 ## Cursor / Copilot Rules
 - No `.cursor/rules/`, `.cursorrules`, or `.github/copilot-instructions.md` files exist.
 - If added later, mirror their guidance here and treat them as authoritative.