b1dae28aa5
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.
53 lines
3.4 KiB
Markdown
53 lines
3.4 KiB
Markdown
# AGENTS.md
|
|
|
|
Repository guide for OpenCode sessions in this repo.
|
|
|
|
## Read First
|
|
|
|
- Trust manifests and workflows over prose when they conflict.
|
|
- Highest-value sources: `terraform/main.tf`, `terraform/variables.tf`, `ansible/site.yml`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`, `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `README.md`, `STABLE_BASELINE.md`, `scripts/refresh-kubeconfig.sh`, `scripts/smoke-check-tailnet-services.sh`.
|
|
|
|
## Current Baseline
|
|
|
|
- HA private cluster: 3 control planes, 5 workers on Proxmox.
|
|
- Proxmox clones come from template `9000` on node `flex`; API VIP is `10.27.27.40` via kube-vip.
|
|
- Storage is `nfs-subdir-external-provisioner` backed by `10.27.27.22:/TheFlash/k8s-nfs` with StorageClass `flash-nfs`.
|
|
- Tailscale is the private access path for Rancher and shared services.
|
|
- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed.
|
|
- `apps/` is suspended by default.
|
|
- Rancher stores state in embedded etcd; backup/restore uses `rancher-backup` to B2.
|
|
|
|
## Common Commands
|
|
|
|
- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`
|
|
- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml`
|
|
- Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system`
|
|
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-ip>`
|
|
- Tailnet smoke check: `ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`
|
|
|
|
## Workflow Rules
|
|
|
|
- Keep diffs small and validate only the directory you edited.
|
|
- Update manifests and docs together when behavior changes.
|
|
- Use `set -euo pipefail` in workflow shell blocks.
|
|
- CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks.
|
|
- One object per Kubernetes YAML file; keep filenames kebab-case.
|
|
- If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP.
|
|
- Bootstrap assumptions that matter: SSH user is `ubuntu`, NIC is `ens18`, API join endpoint is the kube-vip address.
|
|
|
|
## Repo-Specific Gotchas
|
|
|
|
- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR.
|
|
- Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted.
|
|
- Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it.
|
|
- The repo no longer uses a cloud controller manager. If you see `providerID` or Hetzner-specific logic, it is stale.
|
|
- Current private URLs:
|
|
- Rancher: `https://rancher.silverside-gopher.ts.net/`
|
|
- Grafana: `http://grafana.silverside-gopher.ts.net/`
|
|
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
|
|
|
|
## Secrets
|
|
|
|
- Runtime secrets live in Doppler + External Secrets.
|
|
- Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys.
|