update README
This commit is contained in:
@@ -1,144 +1,48 @@
|
|||||||
# AGENTS.md
|
# AGENTS.md
|
||||||
|
|
||||||
Repository guide for agentic contributors working in this repo.
|
Repository guide for OpenCode sessions in this repo.
|
||||||
|
|
||||||
## Scope
|
## Read First
|
||||||
|
|
||||||
- Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
|
- Trust manifests and workflows over prose when they conflict.
|
||||||
- Primary areas: `terraform/`, `ansible/`, `clusters/`, `infrastructure/`, `apps/`, `.gitea/workflows/`.
|
- Highest-value sources: `terraform/main.tf`, `terraform/variables.tf`, `ansible/site.yml`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`, `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `README.md`, `STABLE_BASELINE.md`, `scripts/refresh-kubeconfig.sh`, `scripts/smoke-check-tailnet-services.sh`.
|
||||||
- Treat `README.md` and `STABLE_BASELINE.md` as user-facing context, but prefer current manifests and workflows as source of truth.
|
|
||||||
- Keep changes small and reviewable; prefer the narrowest file set that solves the task.
|
|
||||||
|
|
||||||
## Architecture
|
## Current Baseline
|
||||||
|
|
||||||
- **Terraform** provisions Hetzner servers, network, firewall, load balancer, SSH keys.
|
- HA private cluster: 3 control planes, 3 workers.
|
||||||
- **Ansible** bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
|
- Tailscale is the private access path for Rancher and shared services.
|
||||||
- **Flux** reconciles all cluster addons from this repo after Ansible hands off.
|
- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
|
||||||
- **Rancher** stores state in embedded etcd (NOT an external DB). Backup/restore uses the `rancher-backup` operator to B2.
|
- `apps/` is suspended by default.
|
||||||
- **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
|
- Rancher stores state in embedded etcd; backup/restore uses `rancher-backup` to B2.
|
||||||
- **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets.
|
|
||||||
- Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/`.
|
|
||||||
- Grafana and Prometheus are exposed via dedicated Tailscale LoadBalancer services at `http://grafana.silverside-gopher.ts.net/` and `http://prometheus.silverside-gopher.ts.net:9090/`.
|
|
||||||
|
|
||||||
## Important Files
|
## Common Commands
|
||||||
|
|
||||||
- `terraform/main.tf` — provider and version pins
|
- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
|
||||||
- `terraform/variables.tf` — input surface and defaults
|
- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml`
|
||||||
- `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports)
|
- Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system`
|
||||||
- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup)
|
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
|
||||||
- `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2
|
- Tailnet smoke check: `ssh root@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
|
||||||
- `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources
|
|
||||||
- `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering
|
|
||||||
- `infrastructure/addons/<addon>/` — each addon is a self-contained dir with its own `kustomization.yaml`
|
|
||||||
- `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks
|
|
||||||
|
|
||||||
## Build / Validate / Test
|
## Workflow Rules
|
||||||
|
|
||||||
### Terraform
|
- Keep diffs small and validate only the directory you edited.
|
||||||
|
- Update manifests and docs together when behavior changes.
|
||||||
|
- Use `set -euo pipefail` in workflow shell blocks.
|
||||||
|
- CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
|
||||||
|
- One object per Kubernetes YAML file; keep filenames kebab-case.
|
||||||
|
- If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP.
|
||||||
|
|
||||||
- Format: `terraform -chdir=terraform fmt -recursive`
|
## Repo-Specific Gotchas
|
||||||
- Check formatting: `terraform -chdir=terraform fmt -check -recursive`
|
|
||||||
- Validate: `terraform -chdir=terraform validate`
|
|
||||||
- Plan (full): `terraform -chdir=terraform plan -var-file=../terraform.tfvars`
|
|
||||||
- Plan one resource: `terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]`
|
|
||||||
- Apply: `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
|
|
||||||
- State inspection: `terraform -chdir=terraform state list` / `terraform state show <address>`
|
|
||||||
|
|
||||||
### Ansible
|
- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR.
|
||||||
|
- Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted.
|
||||||
|
- Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it.
|
||||||
|
- Current private URLs:
|
||||||
|
- Rancher: `https://rancher.silverside-gopher.ts.net/`
|
||||||
|
- Grafana: `http://grafana.silverside-gopher.ts.net/`
|
||||||
|
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
|
||||||
|
|
||||||
- Install collections: `ansible-galaxy collection install -r ansible/requirements.yml`
|
## Secrets
|
||||||
- Generate inventory: `cd ansible && python3 generate_inventory.py` (requires Terraform outputs)
|
|
||||||
- Syntax check: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`
|
|
||||||
- Dry-run one host: `ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]`
|
|
||||||
- Full bootstrap: `ansible-playbook ansible/site.yml`
|
|
||||||
- Targeted: `ansible-playbook ansible/site.yml -t upgrade` or `-t reset`
|
|
||||||
- Dashboards only: `ansible-playbook ansible/dashboards.yml`
|
|
||||||
|
|
||||||
### Python
|
- Runtime secrets live in Doppler + External Secrets.
|
||||||
|
- Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys.
|
||||||
- Syntax check: `python3 -m py_compile ansible/generate_inventory.py`
|
|
||||||
- Run: `cd ansible && python3 generate_inventory.py`
|
|
||||||
|
|
||||||
### Kubernetes / Flux manifests
|
|
||||||
|
|
||||||
- Render single addon: `kubectl kustomize infrastructure/addons/<addon>`
|
|
||||||
- Render cluster bootstrap: `kubectl kustomize clusters/prod/flux-system`
|
|
||||||
- Validate only the directory you edited, not the whole repo.
|
|
||||||
|
|
||||||
### Kubeconfig refresh
|
|
||||||
|
|
||||||
- Preferred: `scripts/refresh-kubeconfig.sh <cp1-public-ip>`
|
|
||||||
- Manual: `ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig`
|
|
||||||
|
|
||||||
## Code Style
|
|
||||||
|
|
||||||
### General
|
|
||||||
|
|
||||||
- Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
|
|
||||||
- Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
|
|
||||||
- No comments unless the logic is non-obvious.
|
|
||||||
|
|
||||||
### Terraform / HCL
|
|
||||||
|
|
||||||
- 2-space indent. `terraform {}` block first, then providers, locals, variables, resources, outputs.
|
|
||||||
- `snake_case` for variables, locals, resources. Descriptions on all variables/outputs.
|
|
||||||
- `sensitive = true` on secrets. Run `terraform fmt` instead of hand-formatting.
|
|
||||||
- Use `locals` for reused or non-trivial logic. Explicit `depends_on` only when required.
|
|
||||||
|
|
||||||
### Ansible / YAML
|
|
||||||
|
|
||||||
- 2-space YAML indent. Descriptive task names in sentence case.
|
|
||||||
- Idempotent tasks: `changed_when: false` and `failed_when: false` for probes.
|
|
||||||
- `command`/`shell` only when no dedicated module fits. `shell` only for pipes/redirection/heredocs.
|
|
||||||
- `when` guards and `default(...)` filters over duplicated tasks.
|
|
||||||
- Role names and filenames: kebab-case. Variables: snake_case.
|
|
||||||
- Multi-line shell in workflows: `set -e` or `set -euo pipefail` for fail-fast.
|
|
||||||
|
|
||||||
### Kubernetes / Flux YAML
|
|
||||||
|
|
||||||
- One object per file. Kebab-case filenames matching repo patterns: `helmrelease-*.yaml`, `kustomization-*.yaml`, `*-externalsecret.yaml`.
|
|
||||||
- Addon manifests live in `infrastructure/addons/<addon>/` with a `kustomization.yaml`.
|
|
||||||
- Flux graph objects in `clusters/prod/flux-system/`.
|
|
||||||
- Each addon gets a `kustomization-<addon>.yaml` entry in `infrastructure/addons/` with `dependsOn` for ordering.
|
|
||||||
- Quote strings with `:`, `*`, cron expressions, or shell-sensitive chars.
|
|
||||||
- Preserve existing labels/annotations unless the change specifically needs them.
|
|
||||||
|
|
||||||
### Python
|
|
||||||
|
|
||||||
- PEP 8. Imports ordered: stdlib, third-party, local. `snake_case` for functions/variables.
|
|
||||||
- Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.
|
|
||||||
|
|
||||||
## Known Issues & Workarounds
|
|
||||||
|
|
||||||
- **rancher-backup post-install job** (`rancher-backup-patch-sa`) uses a postRenderer in the HelmRelease to replace the broken `rancher/kuberlr-kubectl` image with `rancher/kubectl`. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead.
|
|
||||||
- **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id`/`aws_secret_access_key`).
|
|
||||||
- **Stale Tailscale devices**: Before service proxies exist, `tailscale-cleanup` removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices via the Tailscale API. Once the cluster has Tailscale services, cleanup is skipped so live proxy nodes are not deleted out from under the operator.
|
|
||||||
- **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null`. CI falls back to `b2_list_buckets` to resolve bucket ID by name.
|
|
||||||
|
|
||||||
## Secrets / Security
|
|
||||||
|
|
||||||
- Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
|
|
||||||
- Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
|
|
||||||
- `terraform.tfvars` and `outputs/` are gitignored. Never print secret values in logs or commits.
|
|
||||||
|
|
||||||
## CI Pipeline (`.gitea/workflows/deploy.yml`)
|
|
||||||
|
|
||||||
1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
|
|
||||||
2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
|
|
||||||
3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
|
|
||||||
4. Rancher wait: wait for Rancher and backup operator to be ready
|
|
||||||
5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
|
|
||||||
6. Health checks: nodes, Flux objects, pods, storage class, and Tailscale URL smoke tests from `control_plane[0]`
|
|
||||||
|
|
||||||
## Editing Practices
|
|
||||||
|
|
||||||
- Read target file and adjacent patterns before editing.
|
|
||||||
- Run the narrowest validation command after edits.
|
|
||||||
- If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
|
|
||||||
- Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
|
|
||||||
- Check `git status` before and after changes.
|
|
||||||
|
|
||||||
## Cursor / Copilot Rules
|
|
||||||
|
|
||||||
- No `.cursor/rules/`, `.cursorrules`, or `.github/copilot-instructions.md` files exist.
|
|
||||||
- If added later, mirror their guidance here and treat them as authoritative.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user