5.3 KiB
5.3 KiB
AGENTS.md
Compact repo guidance for OpenCode sessions. Trust executable sources over docs when they conflict.
Read First
- Highest-value sources:
.gitea/workflows/deploy.yml,.gitea/workflows/destroy.yml,terraform/main.tf,terraform/variables.tf,terraform/servers.tf,ansible/site.yml,ansible/inventory.tmpl,clusters/prod/flux-system/,infrastructure/addons/kustomization.yaml. STABLE_BASELINE.mdstill contains stale Rancher backup/restore references; current workflows and addon manifests do not deploy or restorerancher-backup.
Baseline
- Proxmox HA K3s cluster: 3 control planes, 5 workers, VMIDs
200-202and210-214, nodeflex, template VMID9000, datastoreFlash. - API HA is kube-vip at
10.27.27.40; control planes are10.27.27.30-32, workers are10.27.27.41-45. - SSH user is
ubuntu; Ansible derives the flannel iface fromansible_default_ipv4.interfacewitheth0fallback, so do not hard-codeens18. - Storage is raw-manifest
nfs-subdir-external-provisionerusing10.27.27.239:/TheFlash/k8s-nfsand default StorageClassflash-nfs. - Tailscale is the private access path. Rancher, Grafana, and Prometheus are exposed only through Tailscale services.
appsis intentionally suspended inclusters/prod/flux-system/kustomization-apps.yaml.
Commands
- Terraform:
terraform -chdir=terraform fmt -recursive,terraform -chdir=terraform validate,terraform -chdir=terraform plan -var-file=../terraform.tfvars,terraform -chdir=terraform apply -var-file=../terraform.tfvars. - Ansible setup:
ansible-galaxy collection install -r ansible/requirements.yml, then fromansible/runpython3 generate_inventory.pyandansible-playbook site.yml --syntax-check. - Flux/Kustomize checks:
kubectl kustomize infrastructure/addons/<addon>,kubectl kustomize infrastructure/addons,kubectl kustomize clusters/prod/flux-system. - Kubeconfig refresh:
scripts/refresh-kubeconfig.sh <cp1-ip>; use this if localkubectlfalls back tolocalhost:8080after rebuilds. - Tailnet smoke check from cp1:
ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh. - Fast Grafana content iteration uses
.gitea/workflows/dashboards.ymlandansible/dashboards.yml, not a full cluster rebuild.
Deploy Flow
- Pushes to
mainrun Gitea CI: Terraform fmt/init/validate/plan/apply, Proxmox cleanup/retry, Ansible bootstrap, Flux bootstrap, addon gates, Rancher gate, observability image seeding, health checks, tailnet smoke checks. - Deploy and destroy workflows share
concurrency.group: prod-cluster; destroy only requires workflow inputconfirm: destroyand has no backup gate. - Keep
set -euo pipefailin workflow shell blocks. - Terraform retry cleanup has hard-coded target VMIDs/names in
.gitea/workflows/deploy.yml; update it when changing node counts, names, or VMIDs. - Fresh VMs have unreliable registry/chart egress, so critical images are prepared by
skopeoon the runner and imported withk3s ctr; update the workflow archive lists when adding bootstrap-time images. - CI applies
clusters/prod/flux-system/gotk-components.yamldirectly and then patches Flux controller deployments inline; changes only ingotk-controller-cp1-patches.yamldo not affect CI bootstrap.
GitOps Addons
- Vendored charts are intentional:
infrastructure/charts/{cert-manager,traefik,kube-prometheus-stack,tailscale-operator,rancher}. Do not restore remoteHelmRepositoryobjects unless cluster-side chart fetch reliability is intentionally changed. - External Secrets and Loki/Promtail use Flux
OCIRepository; Rancher, Tailscale, cert-manager, Traefik, and kube-prometheus-stack useGitRepositorychart paths. - Use fully qualified
helmchart.source.toolkit.fluxcd.io/...in scripts; K3s also hashelmcharts.helm.cattle.io, sohelmchart/...can target the wrong resource. doppler-bootstraponly creates theexternal-secretsnamespace and Doppler token secret. The deploy workflow createsClusterSecretStore/doppler-hetznerterraafter ESO CRDs and webhook endpoints exist.- The checked-in
infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yamlis not included by that addon kustomization; do not assume Flux applies it. - Keep Kubernetes manifests one object per file with kebab-case filenames.
Gotchas
- Rancher chart
2.13.3requires Kubernetes<1.35.0-0; K3slatestcan break Rancher. Role defaults pinv1.34.6+k3s1; do not reintroduce a generated-inventoryk3s_version=latestoverride. - The repo no longer uses a cloud controller manager.
providerID, Hetzner CCM/CSI, or Hetzner firewall/load-balancer logic is stale. - Tailscale cleanup must only remove stale offline reserved hostnames before live service proxies exist; do not delete active
rancher,grafana,prometheus, orfluxdevices. - Proxmox endpoint should be the base URL, for example
https://100.105.0.115:8006/; provider/workflow code strips/api2/jsonwhen needed. - Current private URLs: Rancher
https://rancher.silverside-gopher.ts.net/, Grafanahttp://grafana.silverside-gopher.ts.net/, Prometheushttp://prometheus.silverside-gopher.ts.net:9090/.
Secrets
- Runtime secrets are Doppler + External Secrets; Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
- Never commit secrets, kubeconfigs, private keys,
terraform.tfvars, or generatedoutputs/artifacts.