OpenStaticFish/HetznerTerra

Fork 0

Files

T

micqdf 9126de1423

Deploy Cluster / Terraform (push) Successful in 48s

Details

Deploy Cluster / Ansible (push) Failing after 4m52s

Details

fix: Align Prometheus external URL with Tailscale service port

Prometheus is exposed on port 9090 through the Tailscale LoadBalancer
service, so the configured external URL and repo docs should match the
actual address users reach after rebuilds.

2026-04-18 17:11:16 +00:00

8.0 KiB

Raw Blame History

AGENTS.md

Repository guide for agentic contributors working in this repo.

Scope

Infrastructure repo for a Hetzner + k3s + Flux stack running Rancher.
Primary areas: terraform/, ansible/, clusters/, infrastructure/, apps/, .gitea/workflows/.
Treat README.md and STABLE_BASELINE.md as user-facing context, but prefer current manifests and workflows as source of truth.
Keep changes small and reviewable; prefer the narrowest file set that solves the task.

Architecture

Terraform provisions Hetzner servers, network, firewall, load balancer, SSH keys.
Ansible bootstraps OS, installs k3s (with external cloud provider), deploys Hetzner CCM, Tailscale, Doppler token.
Flux reconciles all cluster addons from this repo after Ansible hands off.
Rancher stores state in embedded etcd (NOT an external DB). Backup/restore uses the rancher-backup operator to B2.
cert-manager is required — Tailscale LoadBalancer does L4 TCP passthrough, so Rancher serves its own TLS.
Secrets flow: Doppler → ClusterSecretStore (doppler-hetznerterra) → ExternalSecret resources → k8s Secrets.
Rancher is reachable only over Tailscale at https://rancher.silverside-gopher.ts.net/.
Grafana, Prometheus, and Flux UI are also exposed via dedicated Tailscale LoadBalancer services at http://grafana.silverside-gopher.ts.net/, http://prometheus.silverside-gopher.ts.net:9090/, http://flux.silverside-gopher.ts.net:9001/.

Important Files

terraform/main.tf — provider and version pins
terraform/variables.tf — input surface and defaults
terraform/firewall.tf — firewall rules (tailnet CIDR, internal cluster ports)
ansible/site.yml — ordered bootstrap playbook (roles: common → k3s-server → ccm → k3s-agent → doppler → tailscale-cleanup)
ansible/generate_inventory.py — renders ansible/inventory.ini from Terraform outputs via Jinja2
clusters/prod/flux-system/ — Flux GitRepository and top-level Kustomization resources
infrastructure/addons/kustomization.yaml — root addon graph with dependency ordering
infrastructure/addons/<addon>/ — each addon is a self-contained dir with its own kustomization.yaml
.gitea/workflows/deploy.yml — canonical CI: terraform → ansible → flux bootstrap → B2 restore → health checks

Build / Validate / Test

Terraform

Format: terraform -chdir=terraform fmt -recursive
Check formatting: terraform -chdir=terraform fmt -check -recursive
Validate: terraform -chdir=terraform validate
Plan (full): terraform -chdir=terraform plan -var-file=../terraform.tfvars
Plan one resource: terraform -chdir=terraform plan -var-file=../terraform.tfvars -target=hcloud_server.control_plane[0]
Apply: terraform -chdir=terraform apply -var-file=../terraform.tfvars
State inspection: terraform -chdir=terraform state list / terraform state show <address>

Ansible

Install collections: ansible-galaxy collection install -r ansible/requirements.yml
Generate inventory: cd ansible && python3 generate_inventory.py (requires Terraform outputs)
Syntax check: ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check
Dry-run one host: ansible-playbook -i ansible/inventory.ini ansible/site.yml --check --diff -l control_plane[0]
Full bootstrap: ansible-playbook ansible/site.yml
Targeted: ansible-playbook ansible/site.yml -t upgrade or -t reset
Dashboards only: ansible-playbook ansible/dashboards.yml

Python

Syntax check: python3 -m py_compile ansible/generate_inventory.py
Run: cd ansible && python3 generate_inventory.py

Kubernetes / Flux manifests

Render single addon: kubectl kustomize infrastructure/addons/<addon>
Render cluster bootstrap: kubectl kustomize clusters/prod/flux-system
Validate only the directory you edited, not the whole repo.

Kubeconfig refresh

Preferred: scripts/refresh-kubeconfig.sh <cp1-public-ip>
Manual: ssh -i ~/.ssh/infra root@<cp1-ip> "cat /etc/rancher/k3s/k3s.yaml" | sed 's/127.0.0.1/<cp1-ip>/g' > outputs/kubeconfig

Code Style

General

Match existing style in adjacent files. No new tools/frameworks unless the repo already uses them.
Prefer ASCII. Keep diffs minimal. No unrelated cleanup.
No comments unless the logic is non-obvious.

Terraform / HCL

2-space indent. terraform {} block first, then providers, locals, variables, resources, outputs.
snake_case for variables, locals, resources. Descriptions on all variables/outputs.
sensitive = true on secrets. Run terraform fmt instead of hand-formatting.
Use locals for reused or non-trivial logic. Explicit depends_on only when required.

Ansible / YAML

2-space YAML indent. Descriptive task names in sentence case.
Idempotent tasks: changed_when: false and failed_when: false for probes.
command/shell only when no dedicated module fits. shell only for pipes/redirection/heredocs.
when guards and default(...) filters over duplicated tasks.
Role names and filenames: kebab-case. Variables: snake_case.
Multi-line shell in workflows: set -e or set -euo pipefail for fail-fast.

Kubernetes / Flux YAML

One object per file. Kebab-case filenames matching repo patterns: helmrelease-*.yaml, kustomization-*.yaml, *-externalsecret.yaml.
Addon manifests live in infrastructure/addons/<addon>/ with a kustomization.yaml.
Flux graph objects in clusters/prod/flux-system/.
Each addon gets a kustomization-<addon>.yaml entry in infrastructure/addons/ with dependsOn for ordering.
Quote strings with :, *, cron expressions, or shell-sensitive chars.
Preserve existing labels/annotations unless the change specifically needs them.

Python

PEP 8. Imports ordered: stdlib, third-party, local. snake_case for functions/variables.
Scripts small and explicit. Exit non-zero on failure. Clear subprocess error handling.

Known Issues & Workarounds

rancher-backup post-install job (rancher-backup-patch-sa) uses a postRenderer in the HelmRelease to replace the broken rancher/kuberlr-kubectl image with rancher/kubectl. Do NOT set s3 block in HelmRelease values — put S3 config in the Backup CR instead.
B2 ExternalSecret must use key names accessKey and secretKey (not aws_access_key_id/aws_secret_access_key).
Stale Tailscale devices: After cluster rebuild, delete stale offline rancher devices before booting. The tailscale-cleanup Ansible role handles this via the Tailscale API.
Restricted B2 keys: b2_authorize_account may return allowed.bucketId: null. CI falls back to b2_list_buckets to resolve bucket ID by name.

Secrets / Security

Never commit tokens, passwords, kubeconfigs, private keys, or generated secrets.
Runtime secrets via Gitea secrets (CI), Doppler, or External Secrets Operator.
terraform.tfvars and outputs/ are gitignored. Never print secret values in logs or commits.

CI Pipeline (`.gitea/workflows/deploy.yml`)

Terraform: fmt check → init → validate → import existing servers → plan → apply (main only)
Ansible: install deps → generate inventory → run site.yml with extra vars (secrets injected from Gitea)
Flux bootstrap: install kubectl/flux → rewrite kubeconfig → apply CRDs → apply graph → wait for addons
Rancher wait: wait for Rancher and backup operator to be ready
B2 restore: authorize B2 → find latest backup → create Restore CR → poll until ready
Health checks: nodes, Flux objects, pods, storage class

Editing Practices

Read target file and adjacent patterns before editing.
Run the narrowest validation command after edits.
If you make a live-cluster workaround, also update the declarative manifests so Flux can own it.
Changes spanning Terraform + Ansible + Flux: update and verify each layer separately.
Check git status before and after changes.

Cursor / Copilot Rules

No .cursor/rules/, .cursorrules, or .github/copilot-instructions.md files exist.
If added later, mirror their guidance here and treat them as authoritative.

8.0 KiB Raw Blame History

AGENTS.md

Scope

Architecture

Important Files

Build / Validate / Test

Terraform

Ansible

Python

Kubernetes / Flux manifests

Kubeconfig refresh

Code Style

General

Terraform / HCL

Ansible / YAML

Kubernetes / Flux YAML

Python

Known Issues & Workarounds

Secrets / Security

CI Pipeline (.gitea/workflows/deploy.yml)

Editing Practices

Cursor / Copilot Rules

8.0 KiB

Raw Blame History

CI Pipeline (`.gitea/workflows/deploy.yml`)