docs
This commit is contained in:
@@ -1,52 +1,57 @@
|
||||
# AGENTS.md
|
||||
|
||||
Repository guide for OpenCode sessions in this repo.
|
||||
Compact repo guidance for OpenCode sessions. Trust executable sources over docs when they conflict.
|
||||
|
||||
## Read First
|
||||
|
||||
- Trust manifests and workflows over prose when they conflict.
|
||||
- Highest-value sources: `terraform/main.tf`, `terraform/variables.tf`, `ansible/site.yml`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`, `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `README.md`, `STABLE_BASELINE.md`, `scripts/refresh-kubeconfig.sh`, `scripts/smoke-check-tailnet-services.sh`.
|
||||
- Highest-value sources: `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `terraform/main.tf`, `terraform/variables.tf`, `terraform/servers.tf`, `ansible/site.yml`, `ansible/inventory.tmpl`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`.
|
||||
- `STABLE_BASELINE.md` still contains stale Rancher backup/restore references; current workflows and addon manifests do not deploy or restore `rancher-backup`.
|
||||
|
||||
## Current Baseline
|
||||
## Baseline
|
||||
|
||||
- HA private cluster: 3 control planes, 5 workers on Proxmox.
|
||||
- Proxmox clones come from template `9000` on node `flex`; API VIP is `10.27.27.40` via kube-vip.
|
||||
- Storage is `nfs-subdir-external-provisioner` backed by `10.27.27.239:/TheFlash/k8s-nfs` with StorageClass `flash-nfs`.
|
||||
- Tailscale is the private access path for Rancher and shared services.
|
||||
- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
|
||||
- `apps/` is suspended by default.
|
||||
- Rancher stores state in embedded etcd; backup/restore uses `rancher-backup` to B2.
|
||||
- Proxmox HA K3s cluster: 3 control planes, 5 workers, VMIDs `200-202` and `210-214`, node `flex`, template VMID `9000`, datastore `Flash`.
|
||||
- API HA is kube-vip at `10.27.27.40`; control planes are `10.27.27.30-32`, workers are `10.27.27.41-45`.
|
||||
- SSH user is `ubuntu`; Ansible derives the flannel iface from `ansible_default_ipv4.interface` with `eth0` fallback, so do not hard-code `ens18`.
|
||||
- Storage is raw-manifest `nfs-subdir-external-provisioner` using `10.27.27.239:/TheFlash/k8s-nfs` and default StorageClass `flash-nfs`.
|
||||
- Tailscale is the private access path. Rancher, Grafana, and Prometheus are exposed only through Tailscale services.
|
||||
- `apps` is intentionally suspended in `clusters/prod/flux-system/kustomization-apps.yaml`.
|
||||
|
||||
## Common Commands
|
||||
## Commands
|
||||
|
||||
- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
|
||||
- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml`
|
||||
- Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system`
|
||||
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-ip>`
|
||||
- Tailnet smoke check: `ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
|
||||
- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`.
|
||||
- Ansible setup: `ansible-galaxy collection install -r ansible/requirements.yml`, then from `ansible/` run `python3 generate_inventory.py` and `ansible-playbook site.yml --syntax-check`.
|
||||
- Flux/Kustomize checks: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize infrastructure/addons`, `kubectl kustomize clusters/prod/flux-system`.
|
||||
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-ip>`; use this if local `kubectl` falls back to `localhost:8080` after rebuilds.
|
||||
- Tailnet smoke check from cp1: `ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`.
|
||||
- Fast Grafana content iteration uses `.gitea/workflows/dashboards.yml` and `ansible/dashboards.yml`, not a full cluster rebuild.
|
||||
|
||||
## Workflow Rules
|
||||
## Deploy Flow
|
||||
|
||||
- Keep diffs small and validate only the directory you edited.
|
||||
- Update manifests and docs together when behavior changes.
|
||||
- Use `set -euo pipefail` in workflow shell blocks.
|
||||
- CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
|
||||
- One object per Kubernetes YAML file; keep filenames kebab-case.
|
||||
- If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP.
|
||||
- Bootstrap assumptions that matter: SSH user is `ubuntu`, NIC is `ens18`, API join endpoint is the kube-vip address.
|
||||
- Pushes to `main` run Gitea CI: Terraform fmt/init/validate/plan/apply, Proxmox cleanup/retry, Ansible bootstrap, Flux bootstrap, addon gates, Rancher gate, observability image seeding, health checks, tailnet smoke checks.
|
||||
- Deploy and destroy workflows share `concurrency.group: prod-cluster`; destroy only requires workflow input `confirm: destroy` and has no backup gate.
|
||||
- Keep `set -euo pipefail` in workflow shell blocks.
|
||||
- Terraform retry cleanup has hard-coded target VMIDs/names in `.gitea/workflows/deploy.yml`; update it when changing node counts, names, or VMIDs.
|
||||
- Fresh VMs have unreliable registry/chart egress, so critical images are prepared by `skopeo` on the runner and imported with `k3s ctr`; update the workflow archive lists when adding bootstrap-time images.
|
||||
- CI applies `clusters/prod/flux-system/gotk-components.yaml` directly and then patches Flux controller deployments inline; changes only in `gotk-controller-cp1-patches.yaml` do not affect CI bootstrap.
|
||||
|
||||
## Repo-Specific Gotchas
|
||||
## GitOps Addons
|
||||
|
||||
- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR.
|
||||
- Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted.
|
||||
- Keep the vendored Tailscale operator chart at `infrastructure/charts/tailscale-operator` pinned to `1.96.5`; do not restore the remote HelmRepository unless cluster-side chart fetches are reliable.
|
||||
- The repo no longer uses a cloud controller manager. If you see `providerID` or Hetzner-specific logic, it is stale.
|
||||
- Current private URLs:
|
||||
- Rancher: `https://rancher.silverside-gopher.ts.net/`
|
||||
- Grafana: `http://grafana.silverside-gopher.ts.net/`
|
||||
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
|
||||
- Vendored charts are intentional: `infrastructure/charts/{cert-manager,traefik,kube-prometheus-stack,tailscale-operator,rancher}`. Do not restore remote `HelmRepository` objects unless cluster-side chart fetch reliability is intentionally changed.
|
||||
- External Secrets and Loki/Promtail use Flux `OCIRepository`; Rancher, Tailscale, cert-manager, Traefik, and kube-prometheus-stack use `GitRepository` chart paths.
|
||||
- Use fully qualified `helmchart.source.toolkit.fluxcd.io/...` in scripts; K3s also has `helmcharts.helm.cattle.io`, so `helmchart/...` can target the wrong resource.
|
||||
- `doppler-bootstrap` only creates the `external-secrets` namespace and Doppler token secret. The deploy workflow creates `ClusterSecretStore/doppler-hetznerterra` after ESO CRDs and webhook endpoints exist.
|
||||
- The checked-in `infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml` is not included by that addon kustomization; do not assume Flux applies it.
|
||||
- Keep Kubernetes manifests one object per file with kebab-case filenames.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- Rancher chart `2.13.3` requires Kubernetes `<1.35.0-0`; K3s `latest` can break Rancher. Role defaults pin `v1.34.6+k3s1`; do not reintroduce a generated-inventory `k3s_version=latest` override.
|
||||
- The repo no longer uses a cloud controller manager. `providerID`, Hetzner CCM/CSI, or Hetzner firewall/load-balancer logic is stale.
|
||||
- Tailscale cleanup must only remove stale offline reserved hostnames before live service proxies exist; do not delete active `rancher`, `grafana`, `prometheus`, or `flux` devices.
|
||||
- Proxmox endpoint should be the base URL, for example `https://100.105.0.115:8006/`; provider/workflow code strips `/api2/json` when needed.
|
||||
- Current private URLs: Rancher `https://rancher.silverside-gopher.ts.net/`, Grafana `http://grafana.silverside-gopher.ts.net/`, Prometheus `http://prometheus.silverside-gopher.ts.net:9090/`.
|
||||
|
||||
## Secrets
|
||||
|
||||
- Runtime secrets live in Doppler + External Secrets.
|
||||
- Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys.
|
||||
- Runtime secrets are Doppler + External Secrets; Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
|
||||
- Never commit secrets, kubeconfigs, private keys, `terraform.tfvars`, or generated `outputs/` artifacts.
|
||||
|
||||
Reference in New Issue
Block a user