cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround

- Remove ansible/roles/private-access/ (replaced by Tailscale LB services) - Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs) - Remove orphaned kustomization-traefik-config.yaml (no backing directory) - Simplify CI: remove SA patch + job deletion workaround for rancher-backup (now handled by postRenderer in HelmRelease) - Update AGENTS.md to reflect current architecture
2026-04-02 01:21:23 +00:00
parent 0d339b3163
commit ceefcc3b29
8 changed files with 6 additions and 195 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -18,18 +18,19 @@ Repository guide for agentic contributors working in this repo.
 - **cert-manager** is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
 - **Secrets flow**: Doppler → `ClusterSecretStore` (doppler-hetznerterra) → `ExternalSecret` resources → k8s Secrets.
 - Rancher is reachable only over Tailscale at `https://rancher.silverside-gopher.ts.net/`.
+- Grafana, Prometheus, and Flux UI are also exposed via dedicated Tailscale LoadBalancer services at `http://grafana.silverside-gopher.ts.net/`, `http://prometheus.silverside-gopher.ts.net/`, `http://flux.silverside-gopher.ts.net:9001/`.

 ## Important Files

 - `terraform/main.tf` — provider and version pins
 - `terraform/variables.tf` — input surface and defaults
 - `terraform/firewall.tf` — firewall rules (tailnet CIDR, internal cluster ports)
- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → private-access → doppler → tailscale-cleanup)
+- `ansible/site.yml` — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup)
 - `ansible/generate_inventory.py` — renders `ansible/inventory.ini` from Terraform outputs via Jinja2
 - `clusters/prod/flux-system/` — Flux GitRepository and top-level Kustomization resources
 - `infrastructure/addons/kustomization.yaml` — root addon graph with dependency ordering
 - `infrastructure/addons/<addon>/` — each addon is a self-contained dir with its own `kustomization.yaml`
- `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → rancher fix → B2 restore
+- `.gitea/workflows/deploy.yml` — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks

 ## Build / Validate / Test

@@ -109,7 +110,7 @@ Repository guide for agentic contributors working in this repo.

 ## Known Issues & Workarounds

- **rancher-backup post-install job** (`rancher-backup-patch-sa`) fails because `rancher/kuberlr-kubectl` can't download kubectl. CI patches the SA and deletes the failed job. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead.
+- **rancher-backup post-install job** (`rancher-backup-patch-sa`) uses a postRenderer in the HelmRelease to replace the broken `rancher/kuberlr-kubectl` image with `rancher/kubectl`. Do NOT set `s3` block in HelmRelease values — put S3 config in the Backup CR instead.
 - **B2 ExternalSecret** must use key names `accessKey` and `secretKey` (not `aws_access_key_id`/`aws_secret_access_key`).
 - **Stale Tailscale devices**: After cluster rebuild, delete stale offline `rancher` devices before booting. The `tailscale-cleanup` Ansible role handles this via the Tailscale API.
 - **Restricted B2 keys**: `b2_authorize_account` may return `allowed.bucketId: null`. CI falls back to `b2_list_buckets` to resolve bucket ID by name.
@@ -125,7 +126,7 @@ Repository guide for agentic contributors working in this repo.
 1. Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
 2. Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
 3. Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
-4. Rancher post-install: wait for Rancher/backup operator → patch SA → clean failed jobs → force reconcile
+4. Rancher wait: wait for Rancher and backup operator to be ready
 5. B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
 6. Health checks: nodes, Flux objects, pods, storage class