115 Commits

Author SHA1 Message Date
micqdf e9327b0c61 fix: stop preloading observability images everywhere
Deploy Cluster / Terraform (push) Successful in 34s
Deploy Cluster / Ansible (push) Failing after 54m12s
2026-05-01 07:52:35 +00:00
micqdf cf49f8bf03 fix: make observability image seeding best effort
Deploy Cluster / Terraform (push) Successful in 33s
Deploy Cluster / Ansible (push) Failing after 1h9m9s
2026-04-30 21:02:20 +00:00
micqdf d57e8c8fe8 fix: reset tailscale helm release directly
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 33m39s
2026-04-30 20:25:48 +00:00
micqdf 93a2a42917 fix: simplify tailscale operator health gate
Deploy Cluster / Terraform (push) Successful in 33s
Deploy Cluster / Ansible (push) Failing after 40m18s
2026-04-30 19:34:33 +00:00
micqdf 5cf68771dd fix: wait longer for flux health reconciles
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 41m42s
2026-04-30 17:26:16 +00:00
micqdf 6d6e3e8371 fix: import runner image archives during prepull
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 46m46s
2026-04-30 09:08:44 +00:00
micqdf 353a408dac fix: support bullseye pip in gitea runner
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 18m1s
2026-04-30 07:57:55 +00:00
micqdf b3612083ad fix: disable terraform wrapper in gitea workflows
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 25s
2026-04-30 07:54:48 +00:00
micqdf 8c0dbd997d fix: simplify terraform deploy job lifecycle
Deploy Cluster / Terraform (push) Failing after 32s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-30 07:52:49 +00:00
micqdf 3a975a323c fix: remove deploy pr comment post hook
Deploy Cluster / Terraform (push) Failing after 33s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-30 07:49:15 +00:00
micqdf d126de4dc4 fix: align terraform ci version with provider lock
Deploy Cluster / Terraform (push) Failing after 37s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-30 07:45:10 +00:00
micqdf a33a993867 fix: harden cluster rebuild determinism
Deploy Grafana Content / Grafana Content (push) Failing after 1m14s
Deploy Cluster / Terraform (push) Failing after 4m59s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-30 07:36:27 +00:00
micqdf f52e657f9f docs 2026-04-30 07:03:21 +00:00
micqdf f49b08f50c fix: reinstall k3s on version drift
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 33m40s
2026-04-30 06:03:53 +00:00
micqdf 327bb860b7 fix: pin k3s below rancher limit
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 35m0s
2026-04-30 05:23:37 +00:00
micqdf fd5451a5ef fix: wait for ssh before gathering facts
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 1h13m38s
2026-04-30 03:44:13 +00:00
micqdf 7333cb2780 test
Deploy Cluster / Terraform (push) Successful in 2m5s
Deploy Cluster / Ansible (push) Failing after 35m41s
2026-04-30 02:59:47 +00:00
micqdf feecf97cd5 test
Deploy Cluster / Terraform (push) Successful in 2m13s
Deploy Cluster / Ansible (push) Failing after 1m43s
2026-04-30 02:43:30 +00:00
micqdf b5bcec2663 fix: use kubeconfig for observability reset
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Successful in 33m59s
2026-04-27 02:28:38 +00:00
micqdf 0ad56405ee fix: seed grafana observability images
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 31m26s
2026-04-27 01:50:41 +00:00
micqdf d050e8962a fix: seed cert-manager images before flux
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 1h25m21s
2026-04-27 00:04:19 +00:00
micqdf d925eeac3f fix: remove Rancher backup workflow
Deploy Cluster / Terraform (push) Successful in 1m33s
Deploy Cluster / Ansible (push) Failing after 54m21s
2026-04-26 22:13:20 +00:00
micqdf 2bde45e106 fix: allow intentional destroy without backup
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-26 22:01:39 +00:00
micqdf 50752ca4b0 fix: allow initial deploy without Rancher backup
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Successful in 22m37s
2026-04-26 21:27:57 +00:00
micqdf a2ed9555c0 fix: vendor critical bootstrap charts
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 20m0s
2026-04-26 21:01:01 +00:00
micqdf 14462dd870 fix: avoid resetting healthy observability
Deploy Cluster / Terraform (push) Successful in 33s
Deploy Cluster / Ansible (push) Successful in 23m12s
2026-04-26 20:25:42 +00:00
micqdf 0625eee297 fix: uninstall failed observability upgrades
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 42m47s
2026-04-26 18:46:07 +00:00
micqdf 2dc4ab6329 fix: make observability image seeding non-fatal
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 46m33s
2026-04-26 12:34:02 +00:00
micqdf bbec0dfff4 fix: skip traefik in observability seeding
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 18m5s
2026-04-26 12:06:41 +00:00
micqdf 6de826e030 fix: allow cached OCI chart artifacts
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 18m47s
2026-04-26 11:44:24 +00:00
micqdf bdba2b7af2 fix: defer observability image seeding
Deploy Cluster / Terraform (push) Successful in 34s
Deploy Cluster / Ansible (push) Failing after 23m53s
2026-04-26 11:13:22 +00:00
micqdf 499a3462e7 fix: seed observability dependencies
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-26 10:32:25 +00:00
micqdf daf6ccd0e4 fix: retry bootstrap image imports
Deploy Cluster / Terraform (push) Successful in 33s
Deploy Cluster / Ansible (push) Failing after 42m31s
2026-04-26 09:43:31 +00:00
micqdf a6a630000a fix: vendor Tailscale operator chart
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Failing after 23m49s
2026-04-26 09:17:44 +00:00
micqdf ff9e58d44f fix: remove NFS chart fetch dependency
Deploy Cluster / Terraform (push) Successful in 1m37s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-26 07:48:11 +00:00
micqdf 8b94e4dd06 fix: import bootstrap images from runner
Deploy Cluster / Terraform (push) Successful in 1m40s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-26 06:13:37 +00:00
micqdf 547a29e000 fix: require kube-vip image archive
Deploy Cluster / Terraform (push) Successful in 1m46s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-26 05:04:39 +00:00
micqdf 760f0482d4 fix: pass Proxmox delete params in query
Deploy Cluster / Terraform (push) Successful in 1m48s
Deploy Cluster / Ansible (push) Failing after 22m31s
2026-04-26 04:32:01 +00:00
micqdf 440e268e4f fix: seed kube-vip image from runner
Deploy Cluster / Terraform (push) Failing after 1m56s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-26 04:28:21 +00:00
micqdf 24851f5a9b fix: retry transient Proxmox apply failures
Deploy Cluster / Terraform (push) Successful in 1m39s
Deploy Cluster / Ansible (push) Failing after 22m17s
2026-04-26 04:02:14 +00:00
micqdf ded8efe7fb test
Deploy Cluster / Terraform (push) Failing after 1m39s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-26 03:52:54 +00:00
micqdf c10646d228 fix: harden tailnet smoke script
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Successful in 15m29s
2026-04-26 03:09:18 +00:00
micqdf 50d97209e6 fix: ignore Rancher Turtles cleanup hook pod
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Successful in 14m41s
2026-04-26 02:33:21 +00:00
micqdf 46b2ff7d19 fix: harden final health checks
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 17m50s
2026-04-26 02:14:02 +00:00
micqdf a4f1d179e9 fix: use Rancher registry for webhook image
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 26m36s
2026-04-26 01:35:16 +00:00
micqdf 9879de5a86 fix: stop pre-pulling Rancher child images
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Failing after 11m1s
2026-04-26 00:57:49 +00:00
micqdf 195e9bce25 fix: parallelize Rancher child image warmup
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Failing after 23m46s
2026-04-26 00:02:12 +00:00
micqdf 4796606432 fix: warm Rancher child images on all nodes
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 23:30:20 +00:00
micqdf b1eab6a0fa fix: vendor Rancher chart for bootstrap
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 23:08:26 +00:00
micqdf f3c96b65d2 fix: shorten Rancher chart retry windows
Deploy Cluster / Terraform (push) Successful in 34s
Deploy Cluster / Ansible (push) Failing after 25m40s
2026-04-25 22:30:07 +00:00
micqdf c7a375758f fix: retry Rancher chart pulls during waits
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 22:03:09 +00:00
micqdf d0be48b65c fix: gate Tailscale addon on Helm release
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 36m36s
2026-04-25 21:21:34 +00:00
micqdf 40647318b4 fix: tolerate cached Helm repository artifacts
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 29m36s
2026-04-25 20:44:03 +00:00
micqdf cdb26904d2 fix: retry Tailscale chart pulls during bootstrap
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 27m40s
2026-04-25 20:11:43 +00:00
micqdf 3c06e046c2 fix: warm External Secrets image before install
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 21m10s
2026-04-25 19:46:21 +00:00
micqdf 17f1815e7f fix: use CRI pulls for Flux image warmup
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 15m3s
2026-04-25 19:28:29 +00:00
micqdf 66e86e55ea fix: require Flux image warmup before bootstrap
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 23m13s
2026-04-25 19:02:32 +00:00
micqdf 43df412243 fix: handle missing Proxmox VM config during cleanup
Deploy Cluster / Terraform (push) Successful in 1m41s
Deploy Cluster / Ansible (push) Failing after 44m51s
2026-04-25 17:40:51 +00:00
micqdf 383ef9e9ac fix: clean orphan Proxmox cloud-init volumes
Deploy Cluster / Terraform (push) Failing after 19s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-25 17:38:57 +00:00
micqdf 18abc5073b fix: keep concurrent Terraform apply
Deploy Cluster / Terraform (push) Failing after 1m28s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-25 17:30:59 +00:00
micqdf f8da2594ca fix: serialize Proxmox VM apply
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-04-25 17:27:59 +00:00
micqdf e0359f0097 tes
Deploy Cluster / Terraform (push) Failing after 1m26s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-25 17:22:12 +00:00
micqdf 003333a061 fix: make health checks observe Flux readiness
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Successful in 11m14s
2026-04-25 03:52:43 +00:00
micqdf a6071c504b fix: point Promtail at Loki service
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 03:43:23 +00:00
micqdf 08123457f1 fix: ignore stale install hook pods in health check
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 03:41:00 +00:00
micqdf 757d88ed52 fix: use cached Promtail images when available
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 13m15s
2026-04-25 03:25:44 +00:00
micqdf 15defc686f fix: allow slow Promtail image pulls
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 03:10:47 +00:00
micqdf abb7578328 fix: run post-deploy checks with bash
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 12m17s
2026-04-25 02:42:54 +00:00
micqdf bc87a7ca43 fix: avoid immutable observability PVC changes
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 10m47s
2026-04-25 02:25:40 +00:00
micqdf 045880bdd6 fix: ignore stale Rancher helm operation pods
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 02:23:30 +00:00
micqdf bfcf57bcc5 fix: enforce post-deploy health checks
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 02:22:16 +00:00
micqdf 7e3ebec95b fix: wait for Rancher resources before rollout checks
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Successful in 17m31s
2026-04-25 01:54:21 +00:00
micqdf 0c31c3b1d5 fix: fail fast on stalled Flux Helm releases
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 10m33s
2026-04-25 01:40:42 +00:00
micqdf 5523feb563 fix: wait for Rancher Flux resources before rollout
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 39m43s
2026-04-25 00:59:16 +00:00
micqdf cafa2fa0b3 fix: reset stalled bootstrap Helm releases
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 9m5s
2026-04-25 00:48:33 +00:00
micqdf a7fd4c0b97 fix: wait on actual ESO deployment names
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 38m19s
2026-04-25 00:07:48 +00:00
micqdf e56a3a6c38 fix: wait for ESO webhook before ClusterSecretStore
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 10m13s
2026-04-24 23:13:03 +00:00
micqdf 7b2eca07ab fix: pull external-secrets chart from OCI
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 9m41s
2026-04-24 15:24:58 +00:00
micqdf 347ca041ba fix: reduce rerun bootstrap pre-pull delays
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 39m26s
2026-04-24 12:09:34 +00:00
micqdf 3f52bad854 fix: make Ansible reruns faster and idempotent
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-24 11:44:11 +00:00
micqdf c89c31adea fix: clean up Ansible bootstrap warnings
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-24 11:07:13 +00:00
micqdf 68b293efe4 fix: qualify Flux HelmChart bootstrap resources
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-24 10:47:13 +00:00
micqdf 1f465cc0c1 fix: force reconcile bootstrap Helm charts
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 15m37s
2026-04-24 10:17:49 +00:00
micqdf 6e22bd26b3 fix: wait directly on ESO Helm readiness
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 47m9s
2026-04-23 22:09:45 +00:00
micqdf 869880c152 fix: wait for ESO resources before CRD conditions
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 31m14s
2026-04-23 21:17:44 +00:00
micqdf 31e95eb227 fix: pre-pull Flux controllers before bootstrap rollout
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 16m39s
2026-04-23 20:36:57 +00:00
micqdf 12675417bd fix: use correct namespace and deployment name for ESO rollout check
Deploy Cluster / Terraform (push) Successful in 1m36s
Deploy Cluster / Ansible (push) Failing after 40m40s
The ESO deployment is named external-secrets-external-secrets in the
external-secrets namespace, not external-secrets in kube-system.
2026-04-23 19:00:15 +00:00
micqdf 8e081ddfda fix: wait on ESO deployment directly instead of Flux Kustomization status
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 19m8s
The addon-external-secrets Flux Kustomization was timing out during bootstrap
because image pulls on fresh Proxmox VMs are slow. The critical dependency is
the ESO deployment being available for the Doppler ClusterSecretStore. Replace
the Kustomization readiness check with direct checks for ESO CRD establishment
and deployment rollout, which are the actual prerequisites for the next step.
2026-04-23 07:32:19 +00:00
micqdf 4b7517c9c5 fix: health-check external-secrets addon via HelmRelease only
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 17m22s
The external-secrets Kustomization was still using wait=true, which makes Flux
hold the addon in a failed state when the HelmRepository has transient fetch
errors even though the HelmRelease and runtime controller deployments are
healthy. Switch it to an explicit HelmRelease health check like the other
helm-backed addons.
2026-04-23 07:11:21 +00:00
micqdf f9bc53723f fix: make image pre-pull roles fully best effort
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 22m46s
The pre-pull roles were still blocking the playbook because they retried until
success and exhausted their retry budget during registry TLS timeouts. Keep the
image pulls as opportunistic cache warmers, but never let them fail the
bootstrap; log any missed images instead.
2026-04-23 06:41:21 +00:00
micqdf ee6417c18e fix: pre-pull core bootstrap images on cp1 before Flux bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
Fresh clusters were repeatedly timing out while kubelet pulled the pause image,
k3s packaged component images, and Flux controller images onto the first
control plane. Pre-pull the core control-plane bootstrap images into
containerd on cp-1 so Flux and packaged addons start from a warm cache instead
of racing registry TLS timeouts.
2026-04-23 05:55:14 +00:00
micqdf 1156dc0203 fix: pre-pull kube-vip images before waiting for VIP
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 43m31s
The primary control plane was stalling because kubelet still had to pull both
the Rancher pause image and the kube-vip image before the DaemonSet pod could
become Ready. Pre-pull those images into containerd, extend the readiness wait,
and emit pod diagnostics if kube-vip still does not come up.
2026-04-23 03:55:52 +00:00
micqdf 4151027e01 fix: clean stale Tailscale node devices before bootstrap
Deploy Cluster / Terraform (push) Successful in 1m40s
Deploy Cluster / Ansible (push) Failing after 14m30s
Run the Tailscale cleanup role against the cluster hostnames before any node
reconnects to the tailnet. This removes stale offline cp/worker devices from
previous rebuilds so replacement VMs can reclaim their original hostnames
instead of getting -1 suffixes.
2026-04-23 03:25:17 +00:00
micqdf 9269e9df1b docs: add guide for deploying app repos to the cluster
Deploy Cluster / Terraform (push) Successful in 1m36s
Deploy Cluster / Ansible (push) Has been cancelled
Document the recommended two-repo model for application delivery, including
Flux attachment objects, Doppler/ExternalSecret wiring, Tailscale service
exposure, and the steps for enabling the suspended apps layer.
2026-04-23 02:43:00 +00:00
micqdf d9374bc209 fix: remove duplicate wait keys from helm addon kustomizations
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
The repo-only Kustomization healthCheck change accidentally left the original
wait:true keys in the Rancher and Rancher backup Kustomizations, which broke
the infrastructure kustomize build. Remove the duplicate keys so Flux can
apply the HelmRelease-only health checks cleanly.
2026-04-23 02:20:57 +00:00
micqdf c570a476b5 fix: make helm-based addon kustomizations health-check HelmReleases only
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
These addon Kustomizations were using wait=true, which made Flux treat transient
HelmRepository fetch timeouts as addon failures even when the HelmRelease and
runtime workloads were healthy. Switch the affected Kustomizations to explicit
HelmRelease healthChecks so readiness reflects the actual deployed platform
state instead of repository fetch flakiness.
2026-04-23 02:15:45 +00:00
micqdf a7f11ccf94 fix: give Rancher more time to pass startup probe during upgrades
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Successful in 18m59s
Rancher needs longer than the chart default 2-minute startup probe budget on
this cluster while it restores local catalogs and finishes API startup. Extend
the startup probe failure threshold so Helm upgrades can complete instead of
restarting the new pod before it becomes ready.
2026-04-23 01:44:25 +00:00
micqdf a7d540ca65 fix: stop forcing Flux releases during deploy bootstrap
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Successful in 21m12s
Remove the HelmRelease reset/force annotations from the deploy workflow now
that the cluster can converge on its own. The runtime waits remain, but CI no
longer re-triggers Rancher and NFS churn on every bootstrap attempt.
2026-04-23 00:35:31 +00:00
micqdf 098bd98876 fix: wait on Rancher and storage runtime objects during bootstrap
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 25m19s
Flux can leave HelmRelease and Kustomization conditions stale after transient
chart fetch or image pull failures even when the underlying workloads recover.
Switch the deploy workflow to wait on the concrete runtime resources we care
about: the NFS provisioner deployment and StorageClass, Rancher deployment,
webhook, cert-manager issuer/certificate, and the rancher-backup deployment.
2026-04-22 18:41:09 +00:00
micqdf 55d7b8201e fix: make Rancher image pre-pull best effort and disable managed SUC
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 32m19s
Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap
requirement. Treat image pre-pull as opportunistic and disable Rancher's
managed system-upgrade-controller feature so that image is removed from the
critical install path while Rancher and its webhook converge.
2026-04-22 11:33:13 +00:00
micqdf 9c0523e880 fix: pre-pull Rancher images and reset Rancher release during bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 27m30s
Rancher installs were stalling on transient Docker Hub TLS handshake timeouts
for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the
required images onto all nodes after k3s comes up, extend the Rancher HelmRelease
timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher
so bootstrap can recover from stale failed remediation state.
2026-04-22 11:00:54 +00:00
micqdf 8372d562ad fix: reset and force nfs helmrelease during bootstrap
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 20m22s
When the NFS storage HelmRelease has already entered a failed remediation state,
a plain reconcile request is not enough to clear the stale failure counters.
Send requestedAt, resetAt, and forceAt together so helm-controller retries the
release cleanly before the workflow waits on addon-nfs-storage.
2026-04-22 10:35:32 +00:00
micqdf 1bb11dfe3a fix: force nfs storage reconcile during flux bootstrap
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 19m0s
The NFS HelmRelease can remain in a failed state from an earlier bootstrap
attempt even after the backing NFS export is corrected and the pod becomes
healthy. Request a fresh reconcile of the HelmRelease and addon kustomization
before waiting on addon-nfs-storage so the bootstrap step can observe the
recovered state.
2026-04-22 10:08:20 +00:00
micqdf 624cd5aab6 fix: point NFS provisioner at active Proxmox host export
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 18m51s
The cluster nodes can reach the exported NFS path on 10.27.27.239, not
10.27.27.22. Update the storage addon and repo note so the NFS provisioner
mounts the live export and Flux health checks can converge.
2026-04-22 09:46:01 +00:00
micqdf 71bdc6a709 fix: extend Flux bootstrap timeouts on fresh clusters
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 18m44s
Fresh Proxmox clusters need longer for the Flux controller rollouts and first
GitRepository/Kustomization reconciliations, especially while images are still
being pulled onto the control plane. Increase the bootstrap wait windows so CI
does not fail while the controllers are still converging.
2026-04-22 08:36:27 +00:00
micqdf 714f20417b fix: tolerate control-plane taint when pinning Flux to cp1
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 10m19s
Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the
control-plane node is tainted NoSchedule. Add the matching toleration in both
the checked-in patch manifest and the bootstrap workflow so the controllers can
actually schedule and roll out on cp-1.
2026-04-22 05:05:15 +00:00
micqdf c32bec34bc fix: quote kube-vip readiness jsonpath in bootstrap role
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 10m11s
The local kube-vip readiness probe used an unquoted jsonpath predicate,
which made kubectl treat Ready as an identifier instead of a string. Use a
quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod
properly before waiting on the API VIP.
2026-04-22 04:41:48 +00:00
micqdf 6519a7673d fix: wait for kube-vip on primary node during bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 9m11s
The kube-vip DaemonSet is applied before the secondary control planes join,
so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not
exist in the cluster yet. Wait only for the primary node's kube-vip pod and
then verify the VIP is reachable on 6443.
2026-04-22 04:29:29 +00:00
micqdf d1c31cdb91 fix: rely on k3s service readiness instead of installer exit code
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 8m9s
The k3s install script can return non-zero while systemd is still bringing the
service up, especially on worker agents. Do not fail immediately on the
installer command; wait for the service to become active and only emit
install diagnostics if the later readiness check fails.
2026-04-22 04:14:31 +00:00
micqdf b3e88712bd fix: derive cluster network interface from host facts
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 12m32s
The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18.
Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap
tracks the actual interface name instead of a guessed template default.
2026-04-22 03:50:03 +00:00
micqdf 06366ee5e6 fix: accept cloud-init exit code 2 after first boot
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 6m2s
Ubuntu cloud-init returns exit code 2 for some completed boots even when the
status output is 'done'. Treat that as a successful wait state so Ansible can
continue into the package install phase instead of aborting early.
2026-04-22 03:40:55 +00:00
micqdf 9a2d213114 fix: wait for cloud-init before package install during bootstrap
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 2m36s
Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot,
which caused the Ansible common role to fail before the control plane could
finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the
final kubeconfig rewrite when no kubeconfig was fetched yet.
2026-04-22 03:34:53 +00:00
micqdf 9482a0f551 fix: skip clone storage override for linked Proxmox clones
Deploy Cluster / Terraform (push) Successful in 1m43s
Deploy Cluster / Ansible (push) Failing after 6m24s
The bpg/proxmox provider rejects clone.datastore_id when creating linked
clones. Only pass the target datastore when full clones are enabled so the
linked-clone baseline can provision from template 9000 successfully.
2026-04-22 03:22:50 +00:00
micqdf 5c53b8e06e fix: normalize Proxmox endpoint and stop dashboards self-trigger
Deploy Cluster / Terraform (push) Failing after 53s
Deploy Cluster / Ansible (push) Has been skipped
Accept Proxmox API endpoints with or without /api2/json in CI and local
tfvars, and avoid running the dashboards workflow just because its own
workflow file changed during platform migrations.
2026-04-22 03:13:22 +00:00
micqdf b1dae28aa5 feat: migrate cluster baseline from Hetzner to Proxmox
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
2026-04-22 03:02:13 +00:00
600 changed files with 170541 additions and 1675 deletions
+13 -24
View File
@@ -7,22 +7,28 @@ on:
paths: paths:
- "ansible/dashboards.yml" - "ansible/dashboards.yml"
- "ansible/roles/observability-content/**" - "ansible/roles/observability-content/**"
- ".gitea/workflows/dashboards.yml"
workflow_dispatch: workflow_dispatch:
concurrency:
group: prod-cluster
cancel-in-progress: false
env: env:
TF_VERSION: "1.7.0" TF_VERSION: "1.14.9"
TF_VAR_hcloud_token: ${{ secrets.HCLOUD_TOKEN }}
TF_VAR_s3_access_key: ${{ secrets.S3_ACCESS_KEY }} TF_VAR_s3_access_key: ${{ secrets.S3_ACCESS_KEY }}
TF_VAR_s3_secret_key: ${{ secrets.S3_SECRET_KEY }} TF_VAR_s3_secret_key: ${{ secrets.S3_SECRET_KEY }}
TF_VAR_s3_endpoint: ${{ secrets.S3_ENDPOINT }} TF_VAR_s3_endpoint: ${{ secrets.S3_ENDPOINT }}
TF_VAR_s3_bucket: ${{ secrets.S3_BUCKET }} TF_VAR_s3_bucket: ${{ secrets.S3_BUCKET }}
TF_VAR_tailscale_tailnet: ${{ secrets.TAILSCALE_TAILNET }} TF_VAR_tailscale_tailnet: ${{ secrets.TAILSCALE_TAILNET }}
TF_VAR_proxmox_endpoint: ${{ secrets.PROXMOX_ENDPOINT }}
TF_VAR_proxmox_api_token_id: ${{ secrets.PROXMOX_API_TOKEN_ID }}
TF_VAR_proxmox_api_token_secret: ${{ secrets.PROXMOX_API_TOKEN_SECRET }}
TF_VAR_proxmox_insecure: "true"
jobs: jobs:
dashboards: dashboards:
name: Grafana Content name: Grafana Content
runs-on: ubuntu-latest runs-on: ubuntu-22.04
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
@@ -31,6 +37,7 @@ jobs:
uses: hashicorp/setup-terraform@v3 uses: hashicorp/setup-terraform@v3
with: with:
terraform_version: ${{ env.TF_VERSION }} terraform_version: ${{ env.TF_VERSION }}
terraform_wrapper: false
- name: Setup SSH Keys - name: Setup SSH Keys
run: | run: |
@@ -44,6 +51,7 @@ jobs:
working-directory: terraform working-directory: terraform
run: | run: |
terraform init \ terraform init \
-lockfile=readonly \
-backend-config="endpoint=${{ secrets.S3_ENDPOINT }}" \ -backend-config="endpoint=${{ secrets.S3_ENDPOINT }}" \
-backend-config="bucket=${{ secrets.S3_BUCKET }}" \ -backend-config="bucket=${{ secrets.S3_BUCKET }}" \
-backend-config="region=auto" \ -backend-config="region=auto" \
@@ -51,29 +59,10 @@ jobs:
-backend-config="secret_key=${{ secrets.S3_SECRET_KEY }}" \ -backend-config="secret_key=${{ secrets.S3_SECRET_KEY }}" \
-backend-config="skip_requesting_account_id=true" -backend-config="skip_requesting_account_id=true"
- name: Detect runner egress IP
run: |
RUNNER_IP=$(curl -fsSL https://api.ipify.org)
echo "RUNNER_CIDR=[\"${RUNNER_IP}/32\"]" >> "$GITHUB_ENV"
echo "Runner egress IP: ${RUNNER_IP}"
- name: Open SSH/API for current runner CIDR
working-directory: terraform
run: |
terraform apply \
-refresh=false \
-target=hcloud_firewall.cluster \
-var="hcloud_token=${{ secrets.HCLOUD_TOKEN }}" \
-var="ssh_public_key=$HOME/.ssh/id_ed25519.pub" \
-var="ssh_private_key=$HOME/.ssh/id_ed25519" \
-var="allowed_ssh_ips=${RUNNER_CIDR}" \
-var="allowed_api_ips=${RUNNER_CIDR}" \
-auto-approve
- name: Install Python Dependencies - name: Install Python Dependencies
run: | run: |
apt-get update && apt-get install -y python3-pip apt-get update && apt-get install -y python3-pip
pip3 install --break-system-packages ansible kubernetes jinja2 pyyaml pip3 install ansible==8.7.0 kubernetes==26.1.0 jinja2==3.1.5 pyyaml==6.0.2
- name: Install Ansible Collections - name: Install Ansible Collections
run: ansible-galaxy collection install -r ansible/requirements.yml run: ansible-galaxy collection install -r ansible/requirements.yml
File diff suppressed because it is too large Load Diff
+44 -126
View File
@@ -8,109 +8,28 @@ on:
required: true required: true
default: '' default: ''
concurrency:
group: prod-cluster
cancel-in-progress: false
env: env:
TF_VERSION: "1.7.0" TF_VERSION: "1.14.9"
TF_VAR_hcloud_token: ${{ secrets.HCLOUD_TOKEN }}
TF_VAR_s3_access_key: ${{ secrets.S3_ACCESS_KEY }} TF_VAR_s3_access_key: ${{ secrets.S3_ACCESS_KEY }}
TF_VAR_s3_secret_key: ${{ secrets.S3_SECRET_KEY }} TF_VAR_s3_secret_key: ${{ secrets.S3_SECRET_KEY }}
TF_VAR_s3_endpoint: ${{ secrets.S3_ENDPOINT }} TF_VAR_s3_endpoint: ${{ secrets.S3_ENDPOINT }}
TF_VAR_s3_bucket: ${{ secrets.S3_BUCKET }} TF_VAR_s3_bucket: ${{ secrets.S3_BUCKET }}
TF_VAR_tailscale_tailnet: ${{ secrets.TAILSCALE_TAILNET }} TF_VAR_tailscale_tailnet: ${{ secrets.TAILSCALE_TAILNET }}
B2_ACCOUNT_ID: ${{ secrets.B2_ACCOUNT_ID }} TF_VAR_proxmox_endpoint: ${{ secrets.PROXMOX_ENDPOINT }}
B2_APPLICATION_KEY: ${{ secrets.B2_APPLICATION_KEY }} TF_VAR_proxmox_api_token_id: ${{ secrets.PROXMOX_API_TOKEN_ID }}
TF_VAR_proxmox_api_token_secret: ${{ secrets.PROXMOX_API_TOKEN_SECRET }}
TF_VAR_proxmox_insecure: "true"
jobs: jobs:
pre-destroy-backup:
name: Pre-Destroy Backup
runs-on: ubuntu-latest
if: github.event.inputs.confirm == 'destroy'
environment: destroy
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
working-directory: terraform
run: |
terraform init \
-backend-config="endpoint=${{ secrets.S3_ENDPOINT }}" \
-backend-config="bucket=${{ secrets.S3_BUCKET }}" \
-backend-config="region=auto" \
-backend-config="access_key=${{ secrets.S3_ACCESS_KEY }}" \
-backend-config="secret_key=${{ secrets.S3_SECRET_KEY }}" \
-backend-config="skip_requesting_account_id=true"
- name: Setup SSH Keys
run: |
mkdir -p ~/.ssh
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
echo "${{ secrets.SSH_PUBLIC_KEY }}" > ~/.ssh/id_ed25519.pub
chmod 644 ~/.ssh/id_ed25519.pub
- name: Get Control Plane IP
id: cp_ip
working-directory: terraform
run: |
PRIMARY_IP=$(terraform output -raw primary_control_plane_ip)
echo "PRIMARY_IP=${PRIMARY_IP}" >> "$GITHUB_ENV"
- name: Pre-Destroy pg_dump to B2
run: |
set +e
echo "Attempting pre-destroy backup to B2..."
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@${PRIMARY_IP} << 'EOF'
set -e
# Check if kubectl is available and cluster is up
if ! command -v kubectl &> /dev/null; then
echo "kubectl not found, skipping pre-destroy backup"
exit 0
fi
# Check if we can reach the cluster
if ! kubectl cluster-info &> /dev/null; then
echo "Cannot reach cluster, skipping pre-destroy backup"
exit 0
fi
# Check if CNP is deployed
if ! kubectl get namespace cnpg-cluster &> /dev/null; then
echo "CNP namespace not found, skipping pre-destroy backup"
exit 0
fi
# Run backup using the pgdump image directly
BACKUP_FILE="rancher-backup-$(date +%Y%m%d-%H%M%S).sql.gz"
B2_ACCOUNT_ID="$(cat /etc/kubernetes/secret/b2_account_id 2>/dev/null || echo '')"
B2_APPLICATION_KEY="$(cat /etc/kubernetes/secret/b2_application_key 2>/dev/null || echo '')"
if [ -z "$B2_ACCOUNT_ID" ] || [ -z "$B2_APPLICATION_KEY" ]; then
echo "B2 credentials not found in secret, skipping pre-destroy backup"
exit 0
fi
kubectl run pgdump-manual --image=ghcr.io/cloudnative-pg/pgbackrest:latest --restart=Never \
-n cnpg-cluster --dry-run=client -o yaml | \
kubectl apply -f -
echo "Waiting for backup job to complete..."
kubectl wait --for=condition=complete job/pgdump-manual -n cnpg-cluster --timeout=300s || true
kubectl logs job/pgdump-manual -n cnpg-cluster || true
kubectl delete job pgdump-manual -n cnpg-cluster --ignore-not-found=true || true
EOF
echo "Pre-destroy backup step completed (failure is non-fatal)"
destroy: destroy:
name: Destroy Cluster name: Destroy Cluster
runs-on: ubuntu-latest runs-on: ubuntu-22.04
if: github.event.inputs.confirm == 'destroy' if: github.event.inputs.confirm == 'destroy'
environment: destroy environment: destroy
needs: pre-destroy-backup
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
@@ -119,17 +38,7 @@ jobs:
uses: hashicorp/setup-terraform@v3 uses: hashicorp/setup-terraform@v3
with: with:
terraform_version: ${{ env.TF_VERSION }} terraform_version: ${{ env.TF_VERSION }}
terraform_wrapper: false
- name: Terraform Init
working-directory: terraform
run: |
terraform init \
-backend-config="endpoint=${{ secrets.S3_ENDPOINT }}" \
-backend-config="bucket=${{ secrets.S3_BUCKET }}" \
-backend-config="region=auto" \
-backend-config="access_key=${{ secrets.S3_ACCESS_KEY }}" \
-backend-config="secret_key=${{ secrets.S3_SECRET_KEY }}" \
-backend-config="skip_requesting_account_id=true"
- name: Setup SSH Keys - name: Setup SSH Keys
run: | run: |
@@ -139,10 +48,30 @@ jobs:
echo "${{ secrets.SSH_PUBLIC_KEY }}" > ~/.ssh/id_ed25519.pub echo "${{ secrets.SSH_PUBLIC_KEY }}" > ~/.ssh/id_ed25519.pub
chmod 644 ~/.ssh/id_ed25519.pub chmod 644 ~/.ssh/id_ed25519.pub
- name: Install jq - name: Terraform Init
working-directory: terraform
run: | run: |
apt-get update terraform init \
apt-get install -y jq -lockfile=readonly \
-backend-config="endpoint=${{ secrets.S3_ENDPOINT }}" \
-backend-config="bucket=${{ secrets.S3_BUCKET }}" \
-backend-config="region=auto" \
-backend-config="access_key=${{ secrets.S3_ACCESS_KEY }}" \
-backend-config="secret_key=${{ secrets.S3_SECRET_KEY }}" \
-backend-config="skip_requesting_account_id=true"
- name: Save Proxmox target list
run: |
mkdir -p outputs
if ! terraform -chdir=terraform output -json proxmox_target_vms > outputs/proxmox_target_vms.json; then
terraform -chdir=terraform plan \
-refresh=false \
-var="ssh_public_key=$HOME/.ssh/id_ed25519.pub" \
-var="ssh_private_key=$HOME/.ssh/id_ed25519" \
-out=cleanup.tfplan \
-no-color || true
printf '[]' > outputs/proxmox_target_vms.json
fi
- name: Terraform Destroy - name: Terraform Destroy
id: destroy id: destroy
@@ -152,7 +81,7 @@ jobs:
for attempt in 1 2 3; do for attempt in 1 2 3; do
echo "Terraform destroy attempt ${attempt}/3" echo "Terraform destroy attempt ${attempt}/3"
terraform destroy \ terraform destroy \
-var="hcloud_token=${{ secrets.HCLOUD_TOKEN }}" \ -parallelism=2 \
-var="ssh_public_key=$HOME/.ssh/id_ed25519.pub" \ -var="ssh_public_key=$HOME/.ssh/id_ed25519.pub" \
-var="ssh_private_key=$HOME/.ssh/id_ed25519" \ -var="ssh_private_key=$HOME/.ssh/id_ed25519" \
-auto-approve -auto-approve
@@ -164,32 +93,21 @@ jobs:
echo "Terraform destroy failed with exit code ${rc}; retrying in 30s" echo "Terraform destroy failed with exit code ${rc}; retrying in 30s"
sleep 30 sleep 30
terraform refresh \ terraform refresh \
-var="hcloud_token=${{ secrets.HCLOUD_TOKEN }}" \
-var="ssh_public_key=$HOME/.ssh/id_ed25519.pub" \ -var="ssh_public_key=$HOME/.ssh/id_ed25519.pub" \
-var="ssh_private_key=$HOME/.ssh/id_ed25519" || true -var="ssh_private_key=$HOME/.ssh/id_ed25519" || true
fi fi
done done
exit "$rc" exit "$rc"
- name: Hetzner destroy diagnostics - name: Verify Proxmox target VMs removed
if: failure() && steps.destroy.outcome == 'failure' if: success()
env:
HCLOUD_TOKEN: ${{ secrets.HCLOUD_TOKEN }}
run: | run: |
set +e python3 scripts/proxmox-rebuild-cleanup.py --mode post-destroy --targets-file outputs/proxmox_target_vms.json
echo "== Terraform state list ==" if [ -f terraform/cleanup.tfplan ]; then
terraform -chdir=terraform state list || true python3 scripts/proxmox-rebuild-cleanup.py --mode post-destroy --terraform-dir terraform --plan cleanup.tfplan
network_id=$(terraform -chdir=terraform state show hcloud_network.cluster 2>/dev/null | awk '/^id *=/ {gsub(/"/, "", $3); print $3; exit}')
if [ -z "$network_id" ]; then
network_id="11988935"
fi fi
echo "== Hetzner network ==" - name: Terraform state diagnostics
curl -fsSL -H "Authorization: Bearer ${HCLOUD_TOKEN}" "https://api.hetzner.cloud/v1/networks/${network_id}" | jq . || true if: failure() && steps.destroy.outcome == 'failure'
run: |
echo "== Hetzner servers attached to network ==" terraform -chdir=terraform state list || true
curl -fsSL -H "Authorization: Bearer ${HCLOUD_TOKEN}" "https://api.hetzner.cloud/v1/servers" | jq --argjson id "$network_id" '.servers[] | select(any(.private_net[]?; .network == $id)) | {id, name, private_net}' || true
echo "== Hetzner load balancers attached to network =="
curl -fsSL -H "Authorization: Bearer ${HCLOUD_TOKEN}" "https://api.hetzner.cloud/v1/load_balancers" | jq --argjson id "$network_id" '.load_balancers[] | select(any(.private_net[]?; .network == $id)) | {id, name, private_net}' || true
-1
View File
@@ -3,7 +3,6 @@
*.tfstate.* *.tfstate.*
*.tfstate.backup *.tfstate.backup
.terraform/ .terraform/
.terraform.lock.hcl
terraform.tfvars terraform.tfvars
crash.log crash.log
override.tf override.tf
+41 -32
View File
@@ -1,48 +1,57 @@
# AGENTS.md # AGENTS.md
Repository guide for OpenCode sessions in this repo. Compact repo guidance for OpenCode sessions. Trust executable sources over docs when they conflict.
## Read First ## Read First
- Trust manifests and workflows over prose when they conflict. - Highest-value sources: `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `terraform/main.tf`, `terraform/variables.tf`, `terraform/servers.tf`, `ansible/site.yml`, `ansible/inventory.tmpl`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`.
- Highest-value sources: `terraform/main.tf`, `terraform/variables.tf`, `ansible/site.yml`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`, `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `README.md`, `STABLE_BASELINE.md`, `scripts/refresh-kubeconfig.sh`, `scripts/smoke-check-tailnet-services.sh`. - `STABLE_BASELINE.md` still contains stale Rancher backup/restore references; current workflows and addon manifests do not deploy or restore `rancher-backup`.
## Current Baseline ## Baseline
- HA private cluster: 3 control planes, 3 workers. - Proxmox HA K3s cluster: 3 control planes, 5 workers, VMIDs `200-202` and `210-214`, node `flex`, template VMID `9000`, datastore `Flash`.
- Tailscale is the private access path for Rancher and shared services. - API HA is kube-vip at `10.27.27.40`; control planes are `10.27.27.30-32`, workers are `10.27.27.41-45`.
- Rancher, Grafana, and Prometheus are exposed through Tailscale; Flux UI / Weave GitOps is removed. - SSH user is `ubuntu`; Ansible derives the flannel iface from `ansible_default_ipv4.interface` with `eth0` fallback, so do not hard-code `ens18`.
- `apps/` is suspended by default. - Storage is raw-manifest `nfs-subdir-external-provisioner` using `10.27.27.239:/TheFlash/k8s-nfs` and default StorageClass `flash-nfs`.
- Rancher stores state in embedded etcd; backup/restore uses `rancher-backup` to B2. - Tailscale is the private access path. Rancher, Grafana, and Prometheus are exposed only through Tailscale services.
- `apps` is intentionally suspended in `clusters/prod/flux-system/kustomization-apps.yaml`.
## Common Commands ## Commands
- Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars` - Terraform: `terraform -chdir=terraform fmt -recursive`, `terraform -chdir=terraform validate`, `terraform -chdir=terraform plan -var-file=../terraform.tfvars`, `terraform -chdir=terraform apply -var-file=../terraform.tfvars`.
- Ansible: `ansible-galaxy collection install -r ansible/requirements.yml`, `cd ansible && python3 generate_inventory.py`, `ansible-playbook -i ansible/inventory.ini ansible/site.yml --syntax-check`, `ansible-playbook ansible/site.yml` - Ansible setup: `ansible-galaxy collection install -r ansible/requirements.yml`, then from `ansible/` run `python3 generate_inventory.py` and `ansible-playbook site.yml --syntax-check`.
- Flux/Kustomize: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize clusters/prod/flux-system` - Flux/Kustomize checks: `kubectl kustomize infrastructure/addons/<addon>`, `kubectl kustomize infrastructure/addons`, `kubectl kustomize clusters/prod/flux-system`.
- Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-public-ip>` - Kubeconfig refresh: `scripts/refresh-kubeconfig.sh <cp1-ip>`; use this if local `kubectl` falls back to `localhost:8080` after rebuilds.
- Tailnet smoke check: `ssh root@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh` - Tailnet smoke check from cp1: `ssh ubuntu@<cp1-ip> 'bash -s' < scripts/smoke-check-tailnet-services.sh`.
- Fast Grafana content iteration uses `.gitea/workflows/dashboards.yml` and `ansible/dashboards.yml`, not a full cluster rebuild.
## Workflow Rules ## Deploy Flow
- Keep diffs small and validate only the directory you edited. - Pushes to `main` run Gitea CI: Terraform fmt/init/validate/plan/apply, Proxmox cleanup/retry, Ansible bootstrap, Flux bootstrap, addon gates, Rancher gate, observability image seeding, health checks, tailnet smoke checks.
- Update manifests and docs together when behavior changes. - Deploy and destroy workflows share `concurrency.group: prod-cluster`; destroy only requires workflow input `confirm: destroy` and has no backup gate.
- Use `set -euo pipefail` in workflow shell blocks. - Keep `set -euo pipefail` in workflow shell blocks.
- CI deploy order is Terraform -> Ansible -> Flux bootstrap -> Rancher restore -> health checks. - Terraform retry cleanup has hard-coded target VMIDs/names in `.gitea/workflows/deploy.yml`; update it when changing node counts, names, or VMIDs.
- One object per Kubernetes YAML file; keep filenames kebab-case. - Fresh VMs have unreliable registry/chart egress, so critical images are prepared by `skopeo` on the runner and imported with `k3s ctr`; update the workflow archive lists when adding bootstrap-time images.
- If `kubectl` points at `localhost:8080` after a rebuild, refresh kubeconfig from the primary control-plane IP. - CI applies `clusters/prod/flux-system/gotk-components.yaml` directly and then patches Flux controller deployments inline; changes only in `gotk-controller-cp1-patches.yaml` do not affect CI bootstrap.
## Repo-Specific Gotchas ## GitOps Addons
- `rancher-backup` uses a postRenderer to swap the broken hook image to `rancher/kubectl:v1.34.0`; do not put S3 config in HelmRelease values. Put it in the Backup CR. - Vendored charts are intentional: `infrastructure/charts/{cert-manager,traefik,kube-prometheus-stack,tailscale-operator,rancher}`. Do not restore remote `HelmRepository` objects unless cluster-side chart fetch reliability is intentionally changed.
- Tailscale cleanup only runs before service proxies exist; it removes stale offline `rancher`/`grafana`/`prometheus`/`flux` devices, then must stop so live proxies are not deleted. - External Secrets and Loki/Promtail use Flux `OCIRepository`; Rancher, Tailscale, cert-manager, Traefik, and kube-prometheus-stack use `GitRepository` chart paths.
- Keep the Tailscale operator on the stable Helm repo `https://pkgs.tailscale.com/helmcharts` at `1.96.5` unless you have a reason to change it. - Use fully qualified `helmchart.source.toolkit.fluxcd.io/...` in scripts; K3s also has `helmcharts.helm.cattle.io`, so `helmchart/...` can target the wrong resource.
- Current private URLs: - `doppler-bootstrap` only creates the `external-secrets` namespace and Doppler token secret. The deploy workflow creates `ClusterSecretStore/doppler-hetznerterra` after ESO CRDs and webhook endpoints exist.
- Rancher: `https://rancher.silverside-gopher.ts.net/` - The checked-in `infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml` is not included by that addon kustomization; do not assume Flux applies it.
- Grafana: `http://grafana.silverside-gopher.ts.net/` - Keep Kubernetes manifests one object per file with kebab-case filenames.
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
## Gotchas
- Rancher chart `2.13.3` requires Kubernetes `<1.35.0-0`; K3s `latest` can break Rancher. Role defaults pin `v1.34.6+k3s1`; do not reintroduce a generated-inventory `k3s_version=latest` override.
- The repo no longer uses a cloud controller manager. `providerID`, Hetzner CCM/CSI, or Hetzner firewall/load-balancer logic is stale.
- Tailscale cleanup must only remove stale offline reserved hostnames before live service proxies exist; do not delete active `rancher`, `grafana`, `prometheus`, or `flux` devices.
- Proxmox endpoint should be the base URL, for example `https://100.105.0.115:8006/`; provider/workflow code strips `/api2/json` when needed.
- Current private URLs: Rancher `https://rancher.silverside-gopher.ts.net/`, Grafana `http://grafana.silverside-gopher.ts.net/`, Prometheus `http://prometheus.silverside-gopher.ts.net:9090/`.
## Secrets ## Secrets
- Runtime secrets live in Doppler + External Secrets. - Runtime secrets are Doppler + External Secrets; Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
- Bootstrap and CI secrets stay in Gitea; never commit secrets, kubeconfigs, or private keys. - Never commit secrets, kubeconfigs, private keys, `terraform.tfvars`, or generated `outputs/` artifacts.
+287
View File
@@ -0,0 +1,287 @@
# App Repo Deployment Guide
This guide explains the recommended way to deploy an application to this cluster.
## Recommended Model
Use two repos:
- `HetznerTerra` (this repo): cluster, addons, shared infrastructure, Flux wiring
- `your-app-repo`: application source, Dockerfile, CI, Kubernetes manifests or Helm chart
Why:
- cluster lifecycle stays separate from app code
- app CI can build and tag images independently
- this repo remains the source of truth for what the cluster is allowed to deploy
## Current Cluster Assumptions
- Flux is already installed and reconciles this repo from `main`
- `clusters/prod/flux-system/kustomization-apps.yaml` points at `./apps`
- `apps` is suspended by default
- private access is through Tailscale
- runtime secrets should come from Doppler via External Secrets
## Deployment Options
### Option A: Separate app repo
Recommended for most real applications.
Flow:
1. App repo builds and pushes an image.
2. This repo defines a `GitRepository` pointing at the app repo.
3. This repo defines a `Kustomization` pointing at a path in the app repo.
4. Flux pulls the app repo and applies the manifests.
### Option B: In-repo app manifests
Only use this when the application is tiny or tightly coupled to the platform.
Flow:
1. Put Kubernetes manifests directly under `apps/` in this repo.
2. Unsuspend the top-level `apps` Kustomization.
This is simpler, but mixes platform and app changes together.
## App Repo Structure
Suggested layout:
```text
your-app-repo/
├── src/
├── Dockerfile
├── .gitea/workflows/
└── deploy/
├── base/
│ ├── namespace.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── externalsecret.yaml
│ └── kustomization.yaml
└── prod/
├── kustomization.yaml
└── patch-*.yaml
```
If you prefer Helm, replace `deploy/base` and `deploy/prod` with a chart path and point Flux at that instead.
## What the App Repo Should Own
- application source code
- image build pipeline
- image tag strategy
- Deployment / Service / Ingress or Tailscale-facing Service manifests
- app-specific `ExternalSecret` manifests
- app-specific namespace
## What This Repo Should Own
- cluster-level permission to deploy the app
- the `GitRepository` and top-level `Kustomization` that attach the app repo to the cluster
- whether the `apps` layer is suspended or active
## Recommended First App Integration
In this repo, add Flux objects under `apps/` that point to the app repo.
Example files to add:
- `apps/gitrepository-my-app.yaml`
- `apps/kustomization-my-app.yaml`
- update `apps/kustomization.yaml`
Example `apps/gitrepository-my-app.yaml`:
```yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: my-app
namespace: flux-system
spec:
interval: 1m
ref:
branch: main
secretRef:
name: flux-system
url: ssh://git@<your-git-host>:<port>/<org>/<your-app-repo>.git
```
Example `apps/kustomization-my-app.yaml`:
```yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: my-app
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: my-app
path: ./deploy/prod
wait: true
timeout: 5m
dependsOn:
- name: infrastructure
```
Then update `apps/kustomization.yaml`:
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gitrepository-my-app.yaml
- kustomization-my-app.yaml
```
## App Secrets
Recommended path:
1. Put runtime values in Doppler.
2. In the app manifests, create an `ExternalSecret` that reads from `doppler-hetznerterra`.
3. Reference the resulting Kubernetes Secret from the Deployment.
Example app-side `ExternalSecret`:
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: my-app-env
namespace: my-app
spec:
refreshInterval: 1h
secretStoreRef:
name: doppler-hetznerterra
kind: ClusterSecretStore
target:
name: my-app-env
creationPolicy: Owner
data:
- secretKey: DATABASE_URL
remoteRef:
key: MY_APP_DATABASE_URL
```
## Image Delivery
Recommended flow:
1. App repo CI builds a container image.
2. CI pushes it to a registry.
3. The app repo updates the Kubernetes image tag in `deploy/prod`.
4. Flux notices the Git change and deploys it.
Keep the first version simple. Do not add image automation until the basic deploy path is proven.
## Exposing the App
Pick one:
### Private app over Tailscale
Best fit for this cluster right now.
Create a Service like the existing Rancher/Grafana/Prometheus pattern:
```yaml
apiVersion: v1
kind: Service
metadata:
name: my-app-tailscale
namespace: my-app
annotations:
tailscale.com/hostname: my-app
tailscale.com/tags: "tag:prod"
tailscale.com/proxy-class: infra-stable
spec:
type: LoadBalancer
loadBalancerClass: tailscale
selector:
app.kubernetes.io/name: my-app
ports:
- name: http
port: 80
protocol: TCP
targetPort: 3000
```
Use `http://my-app.<your-tailnet>` or your chosen hostname.
### Cluster-internal only
Create only a `ClusterIP` Service.
### Public ingress
Not recommended as the first app path in this repo. Get the private path working first.
## Enabling the Apps Layer
The cluster-wide `apps` Kustomization is suspended by default.
When you are ready to let Flux deploy app attachments from `apps/`, unsuspend it:
```bash
kubectl -n flux-system patch kustomization apps --type=merge -p '{"spec":{"suspend":false}}'
```
Or commit a change to `clusters/prod/flux-system/kustomization-apps.yaml` changing:
```yaml
suspend: true
```
to:
```yaml
suspend: false
```
## First Deploy Checklist
Before deploying the first app, make sure:
1. app image builds successfully
2. app repo contains valid `deploy/prod` manifests
3. this repo contains the `GitRepository` + `Kustomization` attachment objects
4. required Doppler secrets exist
5. `apps` is unsuspended if you are using the top-level `apps` layer
## Verification Commands
From a machine with cluster access:
```bash
kubectl -n flux-system get gitrepositories,kustomizations
kubectl get ns
kubectl -n my-app get deploy,svc,pods,externalsecret,secret
```
If private over Tailscale:
```bash
kubectl -n my-app get svc my-app-tailscale -o wide
```
## Minimal Recommendation
If you want the simplest, lowest-risk first deploy:
1. create a separate app repo
2. add `deploy/base` + `deploy/prod`
3. add a `GitRepository` + `Kustomization` in this repo under `apps/`
4. keep the app private with a Tailscale `LoadBalancer` Service
5. use Doppler + `ExternalSecret` for runtime config
That matches the current cluster design with the least surprise.
+183 -328
View File
@@ -1,296 +1,268 @@
# Hetzner Kubernetes Cluster # Proxmox Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible. Private HA K3s cluster on Proxmox, provisioned by Terraform, bootstrapped by Ansible, and reconciled by Flux.
## Architecture ## Architecture
| Component | Details | | Component | Current Baseline |
|-----------|---------| |-----------|------------------|
| **Control Plane** | 3x CX23 (HA) | | **Control plane** | 3 Proxmox VMs, VMIDs `200-202`, IPs `10.27.27.30-32`, 2 vCPU / 4 GiB / 32 GiB |
| **Workers** | 3x CX33 | | **Workers** | 5 Proxmox VMs, VMIDs `210-214`, IPs `10.27.27.41-45`, 4 vCPU / 8 GiB / 64 GiB |
| **K8s** | k3s (latest, HA) | | **Kubernetes** | K3s `v1.34.6+k3s1`, HA embedded etcd, kube-vip API VIP `10.27.27.40` |
| **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki | | **Proxmox** | Node `flex`, template VMID `9000`, datastore `Flash`, bridge `vmbr0` |
| **Access** | SSH/API and private services restricted to Tailnet | | **Storage** | Raw-manifest `nfs-subdir-external-provisioner`, `10.27.27.239:/TheFlash/k8s-nfs`, default StorageClass `flash-nfs` |
| **Bootstrap** | Terraform + Ansible + Flux | | **GitOps** | Flux source `platform` on branch `main`; `apps` Kustomization is intentionally suspended |
| **Private access** | Tailscale operator exposes Rancher, Grafana, and Prometheus; no public ingress baseline |
| **Runtime secrets** | Doppler service token bootstraps External Secrets Operator |
K3s is pinned because Rancher chart `2.13.3` requires Kubernetes `<1.35.0-0`.
## Prerequisites ## Prerequisites
### 1. Hetzner Cloud API Token - Terraform `>= 1.0`.
- Ansible with Python `jinja2` and `pyyaml`.
- `kubectl` for local verification.
- Proxmox API token for the `bpg/proxmox` provider.
- S3-compatible bucket for Terraform state, currently Backblaze B2.
- SSH key pair available to Terraform and Ansible, defaulting to `~/.ssh/infra` and `~/.ssh/infra.pub`.
1. Go to [Hetzner Cloud Console](https://console.hetzner.com/) Expected Proxmox inputs:
2. Select your project (or create a new one)
3. Navigate to **Security****API Tokens**
4. Click **Generate API Token**
5. Set description: `k8s-cluster-terraform`
6. Select permissions: **Read & Write**
7. Click **Generate API Token**
8. **Copy the token immediately** - it won't be shown again!
### 2. Backblaze B2 Bucket (for Terraform State) | Setting | Value |
|---------|-------|
| Endpoint | `https://100.105.0.115:8006/` |
| Node | `flex` |
| Clone source | Template VMID `9000` (`ubuntu-2404-k8s-template`) |
| Storage | `Flash` |
1. Go to [Backblaze B2](https://secure.backblaze.com/b2_buckets.htm) ## Local Setup
2. Click **Create a Bucket**
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
4. Choose **Private** access
5. Click **Create Bucket**
6. Create application key:
- Go to **App Keys****Add a New Application Key**
- Name: `terraform-state`
- Allow access to: `k8s-terraform-state` bucket only
- Type: **Read and Write**
- Copy **keyID** (access key) and **applicationKey** (secret key)
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
### 3. SSH Key Pair Create local variables from the example:
```bash
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
```
### 4. Local Tools
- [Terraform](https://terraform.io/downloads) >= 1.0
- [Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
## Setup
### 1. Clone Repository
```bash
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
```
### 2. Configure Variables
```bash ```bash
cp terraform.tfvars.example terraform.tfvars cp terraform.tfvars.example terraform.tfvars
``` ```
Edit `terraform.tfvars`: Important defaults in `terraform.tfvars.example`:
```hcl ```hcl
hcloud_token = "your-hetzner-api-token" proxmox_endpoint = "https://100.105.0.115:8006/"
proxmox_api_token_id = "terraform-prov@pve!k8s-cluster"
proxmox_api_token_secret = "your-proxmox-api-token-secret"
ssh_public_key = "~/.ssh/hetzner_k8s.pub" ssh_public_key = "~/.ssh/infra.pub"
ssh_private_key = "~/.ssh/hetzner_k8s" ssh_private_key = "~/.ssh/infra"
s3_access_key = "your-backblaze-key-id" s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key" s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com" s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state" s3_bucket = "k8s-terraform-state"
tailscale_auth_key = "tskey-auth-..." tailscale_tailnet = "yourtailnet.ts.net"
tailscale_tailnet = "yourtailnet.ts.net" kube_api_vip = "10.27.27.40"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = []
allowed_api_ips = []
``` ```
### 3. Initialize Terraform Initialize Terraform with backend credentials:
```bash ```bash
cd terraform terraform -chdir=terraform init \
-backend-config="endpoint=<s3-endpoint>" \
# Create backend config file (or use CLI args) -backend-config="bucket=<s3-bucket>" \
cat > backend.hcl << EOF -backend-config="region=auto" \
endpoint = "https://s3.eu-central-003.backblazeb2.com" -backend-config="access_key=<s3-access-key>" \
bucket = "k8s-terraform-state" -backend-config="secret_key=<s3-secret-key>" \
access_key = "your-backblaze-key-id" -backend-config="skip_requesting_account_id=true"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
EOF
terraform init -backend-config=backend.hcl
``` ```
### 4. Plan and Apply ## Common Commands
Terraform:
```bash ```bash
terraform plan -var-file=../terraform.tfvars terraform -chdir=terraform fmt -recursive
terraform apply -var-file=../terraform.tfvars terraform -chdir=terraform validate
terraform -chdir=terraform plan -var-file=../terraform.tfvars
terraform -chdir=terraform apply -var-file=../terraform.tfvars
``` ```
### 5. Generate Ansible Inventory Ansible setup:
```bash ```bash
cd ../ansible ansible-galaxy collection install -r ansible/requirements.yml
cd ansible
python3 generate_inventory.py python3 generate_inventory.py
ansible-playbook site.yml --syntax-check
``` ```
### 6. Bootstrap Cluster Manual Ansible bootstrap uses the same extra vars as the deploy workflow:
```bash ```bash
ansible-playbook site.yml cd ansible
ansible-playbook site.yml \
-e "tailscale_auth_key=$TAILSCALE_AUTH_KEY" \
-e "tailscale_tailnet=$TAILSCALE_TAILNET" \
-e "tailscale_oauth_client_id=$TAILSCALE_OAUTH_CLIENT_ID" \
-e "tailscale_oauth_client_secret=$TAILSCALE_OAUTH_CLIENT_SECRET" \
-e "doppler_hetznerterra_service_token=$DOPPLER_HETZNERTERRA_SERVICE_TOKEN" \
-e "tailscale_api_key=${TAILSCALE_API_KEY:-}" \
-e "grafana_admin_password=${GRAFANA_ADMIN_PASSWORD:-}" \
-e "cluster_name=k8s-cluster"
``` ```
### 7. Get Kubeconfig Flux/Kustomize verification:
```bash ```bash
kubectl kustomize infrastructure/addons/<addon>
kubectl kustomize infrastructure/addons
kubectl kustomize clusters/prod/flux-system
```
Refresh kubeconfig after rebuilds:
```bash
scripts/refresh-kubeconfig.sh 10.27.27.30
export KUBECONFIG=$(pwd)/outputs/kubeconfig export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes kubectl get nodes
``` ```
Use `scripts/refresh-kubeconfig.sh <cp1-public-ip>` to refresh kubeconfig against the primary control-plane public IP after rebuilds. Run the tailnet smoke check from cp1:
```bash
ssh ubuntu@10.27.27.30 'bash -s' < scripts/smoke-check-tailnet-services.sh
```
## Gitea CI/CD ## Gitea CI/CD
This repository includes Gitea workflows for: The supported full rebuild path is the Gitea deploy workflow.
- **deploy**: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks | Workflow | Trigger | Purpose |
- **destroy**: Cluster teardown with backup-aware cleanup |----------|---------|---------|
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only | `.gitea/workflows/deploy.yml` | PR to `main`, push to `main`, manual dispatch | PRs run Terraform plan; pushes run Terraform apply, Ansible bootstrap, Flux bootstrap, addon gates, health checks, and tailnet smoke checks |
| `.gitea/workflows/destroy.yml` | Manual dispatch with `confirm: destroy` | Terraform destroy with retries; no Rancher backup gate |
| `.gitea/workflows/dashboards.yml` | Grafana content changes or manual dispatch | Fast Grafana datasource/dashboard update through `ansible/dashboards.yml` |
### Required Gitea Secrets Deploy and destroy share `concurrency.group: prod-cluster` so they do not run at the same time.
Set these in your Gitea repository settings (**Settings** → **Secrets****Actions**): Deploy sequence on push to `main`:
1. Terraform fmt/init/validate/plan/apply.
2. Cleanup/retry around known transient Proxmox clone and disk-update failures.
3. Generate Ansible inventory from Terraform outputs.
4. Prepare critical image archives with `skopeo` on the runner.
5. Run `ansible/site.yml` to bootstrap nodes, K3s, kube-vip, prerequisite secrets, and kubeconfig.
6. Apply Flux CRDs/controllers and the `clusters/prod/flux-system` graph.
7. Gate cert-manager, External Secrets, Tailscale, NFS, Rancher, and observability.
8. Run post-deploy health checks and Tailscale service smoke checks.
Required Gitea secrets:
| Secret | Description | | Secret | Description |
|--------|-------------| |--------|-------------|
| `HCLOUD_TOKEN` | Hetzner Cloud API token | | `PROXMOX_ENDPOINT` | Proxmox API endpoint, for example `https://100.105.0.115:8006/` |
| `S3_ACCESS_KEY` | Backblaze B2 keyID | | `PROXMOX_API_TOKEN_ID` | Proxmox API token ID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey | | `PROXMOX_API_TOKEN_SECRET` | Proxmox API token secret |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) | | `S3_ACCESS_KEY` | S3/Backblaze access key for Terraform state |
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state`) | | `S3_SECRET_KEY` | S3/Backblaze secret key for Terraform state |
| `S3_ENDPOINT` | S3 endpoint, for example `https://s3.eu-central-003.backblazeb2.com` |
| `S3_BUCKET` | Terraform state bucket, for example `k8s-terraform-state` |
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap | | `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net`) | | `TAILSCALE_TAILNET` | Tailnet domain, for example `silverside-gopher.ts.net` |
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator | | `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for the Kubernetes operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator | | `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for the Kubernetes operator |
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets | | `TAILSCALE_API_KEY` | Optional API key used to delete stale offline reserved devices before service proxies exist |
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) | | `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for runtime cluster secrets |
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets | | `GRAFANA_ADMIN_PASSWORD` | Optional Grafana admin password |
| `SSH_PUBLIC_KEY` | SSH public key content | | `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content | | `SSH_PRIVATE_KEY` | SSH private key content |
## GitOps (Flux) ## GitOps Graph
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap. Flux entrypoint:
### Stable private-only baseline ```text
clusters/prod/flux-system/
├── gotk-components.yaml
├── gitrepository-platform.yaml
├── kustomization-infrastructure.yaml
└── kustomization-apps.yaml # suspend: true
```
The current default target is the HA private baseline: Active infrastructure addons from `infrastructure/addons/kustomization.yaml`:
- `3` control plane nodes - `addon-nfs-storage`
- `3` worker nodes - `addon-external-secrets`
- private Hetzner network only - `addon-cert-manager`
- Tailscale for operator and service access - `addon-tailscale-operator`
- Flux-managed platform addons with `apps` suspended by default - `addon-tailscale-proxyclass`
- `traefik` HelmRelease manifests applied directly by the top-level infrastructure Kustomization
- `addon-observability`
- `addon-observability-content`
- `addon-rancher`
- `addon-rancher-config`
Detailed phase gates and success criteria live in `STABLE_BASELINE.md`. Chart/source strategy:
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later. - Vendored charts are intentional: `cert-manager`, `traefik`, `kube-prometheus-stack`, `tailscale-operator`, and `rancher` live under `infrastructure/charts/`.
- External Secrets, Loki, and Promtail use Flux `OCIRepository` sources.
- NFS storage is raw Kubernetes manifests, not a Helm chart.
- Rancher backup/restore is not part of the current live graph.
### Runtime secrets Doppler bootstrap details:
Runtime cluster secrets are moving to Doppler + External Secrets Operator. - `ansible/roles/doppler-bootstrap` creates the `external-secrets` namespace and the Doppler token secret only.
- The deploy workflow creates `ClusterSecretStore/doppler-hetznerterra` after ESO CRDs and webhook endpoints exist.
- The checked-in `infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml` is not included by the addon kustomization.
- Doppler project: `hetznerterra` ## Access URLs
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
- First synced secrets:
- `GRAFANA_ADMIN_PASSWORD`
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler. | Service | URL |
|---------|-----|
| Rancher | `https://rancher.silverside-gopher.ts.net/` |
| Grafana | `http://grafana.silverside-gopher.ts.net/` |
| Prometheus | `http://prometheus.silverside-gopher.ts.net:9090/` |
### Repository layout Fallback port-forward from a tailnet-connected machine:
- `clusters/prod/`: cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/`: `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/`: infrastructure addon reconciliation graph
- `infrastructure/addons/*`: per-addon manifests for Flux-managed cluster addons
- `apps/`: application workload layer (currently scaffolded)
### Reconciliation graph
- `infrastructure` (top-level)
- `addon-ccm`
- `addon-csi` depends on `addon-ccm`
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
### Bootstrap notes
1. Install Flux controllers in `flux-system`.
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
### Current addon status
- Core infrastructure addons are Flux-managed from `infrastructure/addons/`.
- Active Flux addons for the current baseline: `addon-ccm`, `addon-csi`, `addon-cert-manager`, `addon-external-secrets`, `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-observability`, `addon-observability-content`, `addon-rancher`, `addon-rancher-config`, `addon-rancher-backup`, `addon-rancher-backup-config`.
- `apps` remains suspended until workload rollout is explicitly enabled.
- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
### Rancher access
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/`.
- The public Hetzner load balancer path is not used for Rancher.
- Rancher stores state in embedded etcd; no external database is used.
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
- Terraform create succeeds for the default `3` control planes and `3` workers.
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready`.
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
- Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
## Observability Stack
Flux deploys a lightweight observability stack in the `observability` namespace:
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/`.
Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
### Access Grafana and Prometheus
Preferred private access:
- Grafana: `http://grafana.silverside-gopher.ts.net/`
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
Fallback (port-forward from a tailnet-connected machine):
Run from a tailnet-connected machine:
```bash ```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80 kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090 kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
``` ```
Then open: Grafana user is `admin`; password comes from the `GRAFANA_ADMIN_PASSWORD` Doppler secret or the workflow-provided fallback.
- Grafana: http://127.0.0.1:3000 ## Operations
- Prometheus: http://127.0.0.1:9090
Grafana user: `admin` Scale workers by updating `terraform.tfvars` counts, IP lists, and VMID lists together. If node names or VMIDs change, also update the hard-coded retry cleanup target map in `.gitea/workflows/deploy.yml`.
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
### Verify Tailscale exposure Upgrade K3s by changing the role defaults in `ansible/roles/k3s-server/defaults/main.yml` and `ansible/roles/k3s-agent/defaults/main.yml`. Check Rancher chart compatibility before moving to a Kubernetes minor outside `<1.35.0-0`.
Destroy through the Gitea `Destroy` workflow with `confirm: destroy`, or locally with:
```bash ```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig terraform -chdir=terraform destroy -var-file=../terraform.tfvars
```
## Troubleshooting
Check K3s from cp1:
```bash
ssh ubuntu@10.27.27.30 'sudo k3s kubectl get nodes -o wide'
ssh ubuntu@10.27.27.30 'sudo journalctl -u k3s -n 120 --no-pager'
```
Check Flux and Rancher:
```bash
kubectl -n flux-system get gitrepositories,kustomizations,helmreleases,ocirepositories
kubectl -n flux-system describe helmrelease rancher
kubectl -n cattle-system get pods,deploy -o wide
```
Check Tailscale services:
```bash
kubectl -n tailscale-system get pods kubectl -n tailscale-system get pods
kubectl -n cattle-system get svc rancher-tailscale kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale kubectl -n observability get svc grafana-tailscale prometheus-tailscale
@@ -299,131 +271,14 @@ kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyRea
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
``` ```
If `TailscaleProxyReady=False`, check: If local `kubectl` falls back to `localhost:8080`, refresh `outputs/kubeconfig` with `scripts/refresh-kubeconfig.sh 10.27.27.30`.
```bash
kubectl -n tailscale-system logs deployment/operator --tail=100
```
Common cause: OAuth client missing tag/scopes permissions.
### Fast dashboard iteration workflow
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
## File Structure
```
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
│ │ ├── addon-secrets-bootstrap/
│ │ ├── observability-content/
│ │ └── observability/
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
│ ├── ansible.yml
│ └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md
```
## Firewall Rules
| Port | Source | Purpose |
|------|--------|---------|
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
| 30000-32767 | Optional | NodePorts (disabled by default) |
## Operations
### Scale Workers
Edit `terraform.tfvars`:
```hcl
worker_count = 5
```
Then:
```bash
terraform apply
ansible-playbook site.yml
```
### Upgrade k3s
```bash
ansible-playbook site.yml -t upgrade
```
### Destroy Cluster
```bash
terraform destroy
```
## Troubleshooting
### Check k3s Logs
```bash
ssh root@<control-plane-ip> journalctl -u k3s -f
```
### Reset k3s
```bash
ansible-playbook site.yml -t reset
```
## Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|----------|----------|------------|---------|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| **Total** | | | **€28.93/mo** |
## Security Notes ## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure) - Never commit `terraform.tfvars`, kubeconfigs, private keys, `outputs/`, or real secret values.
- Consider adding Hetzner load balancer for API server - Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
- Rotate API tokens regularly - Runtime cluster secrets are sourced from Doppler through External Secrets.
- Use network policies in Kubernetes - This repo does not manage Proxmox/LAN firewalls or public ingress.
- Enable audit logging for production
## License ## License
+14 -7
View File
@@ -1,6 +1,6 @@
# Gitea Secrets Setup # Gitea Secrets Setup
This document describes the secrets required for the HetznerTerra deployment workflow. This document describes the secrets required for the Proxmox-based deployment workflow.
## Required Secrets ## Required Secrets
@@ -9,10 +9,17 @@ Add these secrets in your Gitea repository settings:
### Infrastructure Secrets ### Infrastructure Secrets
#### `HCLOUD_TOKEN` #### `PROXMOX_ENDPOINT`
- Hetzner Cloud API token - Proxmox VE API endpoint
- Get from: https://console.hetzner.com/projects/{project-id}/security/api-tokens - Example: `https://100.105.0.115:8006/`
- Permissions: Read & Write
#### `PROXMOX_API_TOKEN_ID`
- Proxmox API token ID
- Example: `terraform-prov@pve!k8s-cluster`
#### `PROXMOX_API_TOKEN_SECRET`
- Proxmox API token secret
- Create with `pveum user token add terraform-prov@pve k8s-cluster`
#### `S3_ACCESS_KEY` & `S3_SECRET_KEY` #### `S3_ACCESS_KEY` & `S3_SECRET_KEY`
- Backblaze B2 credentials for Terraform state storage - Backblaze B2 credentials for Terraform state storage
@@ -31,7 +38,7 @@ Add these secrets in your Gitea repository settings:
#### `SSH_PRIVATE_KEY` & `SSH_PUBLIC_KEY` #### `SSH_PRIVATE_KEY` & `SSH_PUBLIC_KEY`
- SSH key pair for cluster access - SSH key pair for cluster access
- Generate with: `ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s` - Generate with: `ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra`
- Private key content (include BEGIN/END lines) - Private key content (include BEGIN/END lines)
- Public key content (full line starting with ssh-ed25519) - Public key content (full line starting with ssh-ed25519)
@@ -90,4 +97,4 @@ Check the workflow logs to verify all secrets are being used correctly.
- Prefer Doppler for runtime app/platform secrets after cluster bootstrap - Prefer Doppler for runtime app/platform secrets after cluster bootstrap
- Rotate Tailscale auth keys periodically - Rotate Tailscale auth keys periodically
- Review OAuth client permissions regularly - Review OAuth client permissions regularly
- The workflow automatically opens SSH/API access only for the runner's IP during deployment - CI expects direct SSH access to the Proxmox VMs and direct Proxmox API access
+12 -14
View File
@@ -5,9 +5,9 @@ This document defines the current engineering target for this repository.
## Topology ## Topology
- 3 control planes (HA etcd cluster) - 3 control planes (HA etcd cluster)
- 3 workers - 5 workers
- Hetzner Load Balancer for Kubernetes API - kube-vip API VIP (`10.27.27.40`)
- private Hetzner network - private Proxmox/LAN network (`10.27.27.0/24`)
- Tailscale operator access and service exposure - Tailscale operator access and service exposure
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`) - Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`) - Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
@@ -17,11 +17,10 @@ This document defines the current engineering target for this repository.
## In Scope ## In Scope
- Terraform infrastructure bootstrap - Terraform infrastructure bootstrap
- Ansible k3s bootstrap with external cloud provider - Ansible k3s bootstrap on Ubuntu cloud-init VMs
- **HA control plane (3 nodes with etcd quorum)** - **HA control plane (3 nodes with etcd quorum)**
- **Hetzner Load Balancer for Kubernetes API** - **kube-vip for Kubernetes API HA**
- **Hetzner CCM deployed via Ansible (before workers join)** - **NFS-backed persistent volumes via `nfs-subdir-external-provisioner`**
- **Hetzner CSI for persistent volumes (via Flux)**
- Flux core reconciliation - Flux core reconciliation
- External Secrets Operator with Doppler - External Secrets Operator with Doppler
- Tailscale private access and smoke-check validation - Tailscale private access and smoke-check validation
@@ -45,15 +44,14 @@ This document defines the current engineering target for this repository.
## Phase Gates ## Phase Gates
1. Terraform apply completes for HA topology (3 CP, 3 workers, 1 LB). 1. Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP).
2. Load Balancer is healthy with all 3 control plane targets. 2. Primary control plane bootstraps with `--cluster-init`.
3. Primary control plane bootstraps with `--cluster-init`. 3. kube-vip advertises `10.27.27.40:6443` from the control-plane set.
4. Secondary control planes join via Load Balancer endpoint. 4. Secondary control planes join via the kube-vip endpoint.
5. **CCM deployed via Ansible before workers join** (fixes uninitialized taint issue). 5. Workers join successfully via the kube-vip endpoint.
6. Workers join successfully via Load Balancer and all nodes show proper `providerID`.
7. etcd reports 3 healthy members. 7. etcd reports 3 healthy members.
8. Flux source and infrastructure reconciliation are healthy. 8. Flux source and infrastructure reconciliation are healthy.
9. **CSI deploys and creates `hcloud-volumes` StorageClass**. 9. **NFS provisioner deploys and creates `flash-nfs` StorageClass**.
10. **PVC provisioning tested and working**. 10. **PVC provisioning tested and working**.
11. External Secrets sync required secrets. 11. External Secrets sync required secrets.
12. Tailscale private access works for Rancher, Grafana, and Prometheus. 12. Tailscale private access works for Rancher, Grafana, and Prometheus.
+2 -1
View File
@@ -3,7 +3,8 @@ inventory = inventory.ini
host_key_checking = False host_key_checking = False
retry_files_enabled = False retry_files_enabled = False
roles_path = roles roles_path = roles
stdout_callback = yaml stdout_callback = default
result_format = yaml
interpreter_python = auto_silent interpreter_python = auto_silent
[privilege_escalation] [privilege_escalation]
+1 -2
View File
@@ -13,8 +13,7 @@ control_plane
workers workers
[cluster:vars] [cluster:vars]
ansible_user=root ansible_user=ubuntu
ansible_python_interpreter=/usr/bin/python3 ansible_python_interpreter=/usr/bin/python3
ansible_ssh_private_key_file={{ private_key_file }} ansible_ssh_private_key_file={{ private_key_file }}
k3s_version=latest
kube_api_endpoint={{ kube_api_lb_ip }} kube_api_endpoint={{ kube_api_lb_ip }}
@@ -1,14 +1,4 @@
--- ---
- name: Apply Hetzner cloud secret
shell: >-
kubectl -n kube-system create secret generic hcloud
--from-literal=token='{{ hcloud_token }}'
--from-literal=network='{{ cluster_name }}-network'
--dry-run=client -o yaml | kubectl apply -f -
changed_when: true
no_log: true
when: hcloud_token | default('') | length > 0
- name: Ensure Tailscale operator namespace exists - name: Ensure Tailscale operator namespace exists
command: >- command: >-
kubectl create namespace {{ tailscale_operator_namespace | default('tailscale-system') }} kubectl create namespace {{ tailscale_operator_namespace | default('tailscale-system') }}
@@ -0,0 +1,12 @@
---
bootstrap_prepull_images:
- docker.io/rancher/mirrored-pause:3.6
- docker.io/rancher/mirrored-coredns-coredns:1.14.2
- docker.io/rancher/mirrored-metrics-server:v0.8.1
- docker.io/rancher/local-path-provisioner:v0.0.35
- docker.io/rancher/mirrored-library-traefik:3.6.10
- docker.io/rancher/klipper-helm:v0.9.14-build20260309
- ghcr.io/fluxcd/source-controller:v1.8.0
- ghcr.io/fluxcd/kustomize-controller:v1.8.1
- ghcr.io/fluxcd/helm-controller:v1.5.1
- ghcr.io/fluxcd/notification-controller:v1.8.1
@@ -0,0 +1,59 @@
---
- name: Check for runner-provided bootstrap image archives
stat:
path: "{{ playbook_dir }}/../outputs/bootstrap-image-archives/{{ item | regex_replace('[/:]', '_') }}.tar"
delegate_to: localhost
become: false
register: bootstrap_image_archive_stats
loop: "{{ bootstrap_prepull_images }}"
- name: Ensure remote bootstrap image archive directory exists
file:
path: /tmp/bootstrap-image-archives
state: directory
mode: "0755"
- name: Copy runner-provided bootstrap image archives
copy:
src: "{{ item.stat.path }}"
dest: "/tmp/bootstrap-image-archives/{{ item.item | regex_replace('[/:]', '_') }}.tar"
mode: "0644"
loop: "{{ bootstrap_image_archive_stats.results }}"
loop_control:
label: "{{ item.item }}"
when: item.stat.exists
- name: Import or pull bootstrap images into containerd
shell: |
if /usr/local/bin/ctr -n k8s.io images ls -q | grep -Fx -- "{{ item }}" >/dev/null; then
echo "already present"
exit 0
fi
archive="/tmp/bootstrap-image-archives/{{ item | regex_replace('[/:]', '_') }}.tar"
if [ -s "${archive}" ]; then
for attempt in 1 2 3; do
if /usr/local/bin/ctr -n k8s.io images import "${archive}" && /usr/local/bin/ctr -n k8s.io images ls -q | grep -Fx -- "{{ item }}" >/dev/null; then
echo "imported image"
exit 0
fi
sleep 10
done
fi
for attempt in 1 2 3 4 5; do
if timeout 180s /usr/local/bin/ctr -n k8s.io images pull "{{ item }}"; then
echo "pulled image"
exit 0
fi
sleep 10
done
exit 1
args:
executable: /bin/bash
register: bootstrap_image_pull
loop: "{{ bootstrap_prepull_images }}"
changed_when: "'imported image' in bootstrap_image_pull.stdout or 'pulled image' in bootstrap_image_pull.stdout"
-82
View File
@@ -1,82 +0,0 @@
---
- name: Check if hcloud secret exists
command: kubectl -n kube-system get secret hcloud
register: hcloud_secret_check
changed_when: false
failed_when: false
- name: Fail if hcloud secret is missing
fail:
msg: "hcloud secret not found in kube-system namespace. CCM requires it."
when: hcloud_secret_check.rc != 0
- name: Check if helm is installed
command: which helm
register: helm_check
changed_when: false
failed_when: false
- name: Install helm
when: helm_check.rc != 0
block:
- name: Download helm install script
get_url:
url: https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
dest: /tmp/get-helm-3.sh
mode: "0755"
- name: Run helm install script
command: /tmp/get-helm-3.sh
args:
creates: /usr/local/bin/helm
- name: Add Hetzner Helm repository
kubernetes.core.helm_repository:
name: hcloud
repo_url: https://charts.hetzner.cloud
kubeconfig: /etc/rancher/k3s/k3s.yaml
environment:
KUBECONFIG: /etc/rancher/k3s/k3s.yaml
- name: Deploy Hetzner Cloud Controller Manager
kubernetes.core.helm:
name: hcloud-cloud-controller-manager
chart_ref: hcloud/hcloud-cloud-controller-manager
release_namespace: kube-system
create_namespace: true
values:
networking:
enabled: true
nodeSelector:
kubernetes.io/hostname: "{{ inventory_hostname }}"
additionalTolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
kubeconfig: /etc/rancher/k3s/k3s.yaml
wait: true
wait_timeout: 300s
environment:
KUBECONFIG: /etc/rancher/k3s/k3s.yaml
- name: Wait for CCM to be ready
command: kubectl -n kube-system rollout status deployment/hcloud-cloud-controller-manager --timeout=120s
changed_when: false
register: ccm_rollout
until: ccm_rollout.rc == 0
retries: 3
delay: 10
- name: Pause to ensure CCM is fully ready to process new nodes
pause:
seconds: 10
- name: Verify CCM is removing uninitialized taints
command: kubectl get nodes -o jsonpath='{.items[*].spec.taints[?(@.key=="node.cloudprovider.kubernetes.io/uninitialized")].key}'
register: uninitialized_taints
changed_when: false
failed_when: false
- name: Display taint status
debug:
msg: "Nodes with uninitialized taint: {{ uninitialized_taints.stdout }}"
+52 -6
View File
@@ -1,12 +1,32 @@
--- ---
- name: Check if cloud-init is installed
command: which cloud-init
register: cloud_init_binary
changed_when: false
failed_when: false
- name: Wait for cloud-init to finish first-boot tasks
command: cloud-init status --wait
register: cloud_init_wait
changed_when: false
failed_when: >-
cloud_init_wait.rc not in [0, 2] or
(
'status: done' not in cloud_init_wait.stdout and
'status: disabled' not in cloud_init_wait.stdout
)
when: cloud_init_binary.rc == 0
- name: Update apt cache - name: Update apt cache
apt: apt:
update_cache: true update_cache: true
cache_valid_time: 3600 cache_valid_time: 3600
lock_timeout: 600
- name: Upgrade packages - name: Upgrade packages
apt: apt:
upgrade: dist upgrade: dist
lock_timeout: 600
when: common_upgrade_packages | default(false) when: common_upgrade_packages | default(false)
- name: Install required packages - name: Install required packages
@@ -19,18 +39,27 @@
- lsb-release - lsb-release
- software-properties-common - software-properties-common
- jq - jq
- nfs-common
- htop - htop
- vim - vim
state: present state: present
lock_timeout: 600
- name: Check active swap
command: swapon --noheadings
register: active_swap
changed_when: false
failed_when: false
- name: Disable swap - name: Disable swap
command: swapoff -a command: swapoff -a
changed_when: true changed_when: true
when: active_swap.stdout | trim | length > 0
- name: Remove swap from fstab - name: Remove swap from fstab
mount: lineinfile:
name: swap path: /etc/fstab
fstype: swap regexp: '^\s*[^#]\S+\s+\S+\s+swap\s+.*$'
state: absent state: absent
- name: Load br_netfilter module - name: Load br_netfilter module
@@ -66,6 +95,10 @@
- name: Install tailscale - name: Install tailscale
shell: curl -fsSL https://tailscale.com/install.sh | sh shell: curl -fsSL https://tailscale.com/install.sh | sh
register: tailscale_install
until: tailscale_install.rc == 0
retries: 5
delay: 15
when: when:
- tailscale_auth_key | length > 0 - tailscale_auth_key | length > 0
- tailscale_binary.rc != 0 - tailscale_binary.rc != 0
@@ -78,9 +111,22 @@
failed_when: false failed_when: false
when: tailscale_auth_key | length > 0 when: tailscale_auth_key | length > 0
- name: Connect node to tailnet - name: Parse tailscale connection state
command: tailscale up --authkey {{ tailscale_auth_key }} --hostname {{ inventory_hostname }} --ssh={{ tailscale_ssh | ternary('true', 'false') }} --accept-routes={{ tailscale_accept_routes | ternary('true', 'false') }} set_fact:
tailscale_backend_state: "{{ (tailscale_status.stdout | from_json).BackendState | default('') }}"
when: when:
- tailscale_auth_key | length > 0 - tailscale_auth_key | length > 0
- tailscale_status.rc != 0 or '"BackendState":"Running"' not in tailscale_status.stdout - tailscale_status.rc == 0
- tailscale_status.stdout | length > 0
- name: Connect node to tailnet
command: tailscale up --authkey {{ tailscale_auth_key }} --hostname {{ inventory_hostname }} --ssh={{ tailscale_ssh | ternary('true', 'false') }} --accept-routes={{ tailscale_accept_routes | ternary('true', 'false') }}
register: tailscale_up
until: tailscale_up.rc == 0
retries: 5
delay: 15
no_log: true
when:
- tailscale_auth_key | length > 0
- tailscale_status.rc != 0 or (tailscale_backend_state | default('')) != 'Running'
changed_when: true changed_when: true
+3 -29
View File
@@ -15,36 +15,10 @@
--from-literal=dopplerToken='{{ doppler_hetznerterra_service_token }}' --from-literal=dopplerToken='{{ doppler_hetznerterra_service_token }}'
--dry-run=client -o yaml | kubectl apply -f - --dry-run=client -o yaml | kubectl apply -f -
changed_when: true changed_when: true
no_log: true
- name: Check for ClusterSecretStore CRD
command: kubectl get crd clustersecretstores.external-secrets.io
register: doppler_clustersecretstore_crd
changed_when: false
failed_when: false
- name: Apply Doppler ClusterSecretStore
shell: |
cat <<'EOF' | kubectl apply -f -
apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
name: doppler-hetznerterra
spec:
provider:
doppler:
auth:
secretRef:
dopplerToken:
name: doppler-hetznerterra-service-token
key: dopplerToken
namespace: external-secrets
EOF
changed_when: true
when: doppler_clustersecretstore_crd.rc == 0
- name: Note pending Doppler ClusterSecretStore bootstrap - name: Note pending Doppler ClusterSecretStore bootstrap
debug: debug:
msg: >- msg: >-
Skipping Doppler ClusterSecretStore bootstrap because the External Secrets CRD Doppler service token secret is bootstrapped. The deploy workflow creates the
is not available yet. Re-run after External Secrets is installed. ClusterSecretStore after External Secrets CRDs and webhook endpoints are ready.
when: doppler_clustersecretstore_crd.rc != 0
+3 -2
View File
@@ -1,6 +1,7 @@
--- ---
k3s_version: latest k3s_version: v1.34.6+k3s1
k3s_server_url: "" k3s_server_url: ""
k3s_token: "" k3s_token: ""
k3s_node_ip: "" k3s_node_ip: ""
k3s_kubelet_cloud_provider_external: true k3s_kubelet_cloud_provider_external: false
k3s_flannel_iface: "{{ ansible_default_ipv4.interface | default('eth0') }}"
+75 -30
View File
@@ -1,19 +1,53 @@
--- ---
- name: Check if k3s agent is already installed - name: Check if k3s agent service exists
stat: stat:
path: /usr/local/bin/k3s-agent path: /etc/systemd/system/k3s-agent.service
register: k3s_agent_binary register: k3s_agent_service
- name: Check k3s agent service state
command: systemctl is-active k3s-agent
register: k3s_agent_service_state
changed_when: false
failed_when: false
when: k3s_agent_service.stat.exists
- name: Check installed k3s version
command: k3s --version
register: installed_k3s_version
changed_when: false
failed_when: false
when: k3s_agent_service.stat.exists
- name: Determine whether k3s agent install is needed
set_fact:
k3s_agent_install_needed: >-
{{
(not k3s_agent_service.stat.exists)
or ((k3s_agent_service_state.stdout | default('')) != 'active')
or (k3s_version != 'latest' and k3s_version not in (installed_k3s_version.stdout | default('')))
}}
- name: Download k3s install script - name: Download k3s install script
get_url: get_url:
url: https://get.k3s.io url: https://get.k3s.io
dest: /tmp/install-k3s.sh dest: /tmp/install-k3s.sh
mode: "0755" mode: "0755"
when: not k3s_agent_binary.stat.exists register: k3s_agent_install_script
until: k3s_agent_install_script is succeeded
retries: 5
delay: 10
when: k3s_agent_install_needed
- name: Install k3s agent - name: Install k3s agent
when: not k3s_agent_binary.stat.exists when: k3s_agent_install_needed
block: block:
- name: Wait for Kubernetes API endpoint before agent join
wait_for:
host: "{{ k3s_server_url | regex_replace('^https?://([^:/]+).*$', '\\1') }}"
port: 6443
state: started
timeout: 180
- name: Run k3s agent install - name: Run k3s agent install
environment: environment:
INSTALL_K3S_VERSION: "{{ k3s_version if k3s_version != 'latest' else '' }}" INSTALL_K3S_VERSION: "{{ k3s_version if k3s_version != 'latest' else '' }}"
@@ -22,32 +56,12 @@
command: >- command: >-
/tmp/install-k3s.sh agent /tmp/install-k3s.sh agent
--node-ip {{ k3s_node_ip }} --node-ip {{ k3s_node_ip }}
--flannel-iface=enp7s0 --flannel-iface={{ k3s_flannel_iface }}
{% if k3s_kubelet_cloud_provider_external | bool %}--kubelet-arg=cloud-provider=external{% endif %} {% if k3s_kubelet_cloud_provider_external | bool %}--kubelet-arg=cloud-provider=external{% endif %}
args: register: k3s_agent_install
creates: /usr/local/bin/k3s-agent until: k3s_agent_install.rc == 0
rescue: retries: 3
- name: Show k3s-agent service status after failed install delay: 20
command: systemctl status k3s-agent --no-pager
register: k3s_agent_status_after_install
changed_when: false
failed_when: false
- name: Show recent k3s-agent logs after failed install
command: journalctl -u k3s-agent -n 120 --no-pager
register: k3s_agent_journal_after_install
changed_when: false
failed_when: false
- name: Fail with k3s-agent diagnostics
fail:
msg: |
k3s agent install failed on {{ inventory_hostname }}.
Service status:
{{ k3s_agent_status_after_install.stdout | default('n/a') }}
Recent logs:
{{ k3s_agent_journal_after_install.stdout | default('n/a') }}
- name: Wait for k3s agent to be ready - name: Wait for k3s agent to be ready
command: systemctl is-active k3s-agent command: systemctl is-active k3s-agent
@@ -56,3 +70,34 @@
retries: 30 retries: 30
delay: 10 delay: 10
changed_when: false changed_when: false
- name: Show k3s-agent service status on failure
command: systemctl status k3s-agent --no-pager
register: k3s_agent_status
changed_when: false
failed_when: false
when: agent_status is failed
- name: Show recent k3s-agent logs on failure
command: journalctl -u k3s-agent -n 120 --no-pager
register: k3s_agent_journal
changed_when: false
failed_when: false
when: agent_status is failed
- name: Fail with k3s-agent diagnostics
fail:
msg: |
k3s agent failed to become ready on {{ inventory_hostname }}.
Install stdout:
{{ k3s_agent_install.stdout | default('n/a') }}
Install stderr:
{{ k3s_agent_install.stderr | default('n/a') }}
Service status:
{{ k3s_agent_status.stdout | default('n/a') }}
Recent logs:
{{ k3s_agent_journal.stdout | default('n/a') }}
when: agent_status is failed
+4 -3
View File
@@ -1,11 +1,12 @@
--- ---
k3s_version: latest k3s_version: v1.34.6+k3s1
k3s_token: "" k3s_token: ""
k3s_node_ip: "" k3s_node_ip: ""
k3s_primary_public_ip: "" k3s_primary_public_ip: ""
k3s_disable_embedded_ccm: true k3s_disable_embedded_ccm: false
k3s_disable_servicelb: true k3s_disable_servicelb: true
k3s_kubelet_cloud_provider_external: true k3s_kubelet_cloud_provider_external: false
k3s_flannel_iface: "{{ ansible_default_ipv4.interface | default('eth0') }}"
# Load Balancer endpoint for HA cluster joins (set in inventory) # Load Balancer endpoint for HA cluster joins (set in inventory)
kube_api_endpoint: "" kube_api_endpoint: ""
# Tailscale DNS names for control planes (to enable tailnet access) # Tailscale DNS names for control planes (to enable tailnet access)
+26 -32
View File
@@ -11,9 +11,21 @@
failed_when: false failed_when: false
when: k3s_service.stat.exists when: k3s_service.stat.exists
- name: Check installed k3s version
command: k3s --version
register: installed_k3s_version
changed_when: false
failed_when: false
when: k3s_service.stat.exists
- name: Determine whether k3s install is needed - name: Determine whether k3s install is needed
set_fact: set_fact:
k3s_install_needed: "{{ (not k3s_service.stat.exists) or ((k3s_service_state.stdout | default('')) != 'active') }}" k3s_install_needed: >-
{{
(not k3s_service.stat.exists)
or ((k3s_service_state.stdout | default('')) != 'active')
or (k3s_version != 'latest' and k3s_version not in (installed_k3s_version.stdout | default('')))
}}
- name: Wait for API endpoint on 6443 (secondary only) - name: Wait for API endpoint on 6443 (secondary only)
wait_for: wait_for:
@@ -50,6 +62,10 @@
url: https://get.k3s.io url: https://get.k3s.io
dest: /tmp/install-k3s.sh dest: /tmp/install-k3s.sh
mode: "0755" mode: "0755"
register: k3s_install_script
until: k3s_install_script is succeeded
retries: 5
delay: 10
when: k3s_install_needed when: k3s_install_needed
- name: Install k3s server (primary) - name: Install k3s server (primary)
@@ -61,7 +77,7 @@
--cluster-init --cluster-init
--advertise-address={{ k3s_primary_ip }} --advertise-address={{ k3s_primary_ip }}
--node-ip={{ k3s_node_ip }} --node-ip={{ k3s_node_ip }}
--flannel-iface=enp7s0 --flannel-iface={{ k3s_flannel_iface }}
--tls-san={{ k3s_primary_ip }} --tls-san={{ k3s_primary_ip }}
--tls-san={{ k3s_primary_public_ip }} --tls-san={{ k3s_primary_public_ip }}
--tls-san={{ kube_api_endpoint }} --tls-san={{ kube_api_endpoint }}
@@ -69,6 +85,10 @@
{% if k3s_disable_embedded_ccm | bool %}--disable-cloud-controller{% endif %} {% if k3s_disable_embedded_ccm | bool %}--disable-cloud-controller{% endif %}
{% if k3s_disable_servicelb | bool %}--disable=servicelb{% endif %} {% if k3s_disable_servicelb | bool %}--disable=servicelb{% endif %}
{% if k3s_kubelet_cloud_provider_external | bool %}--kubelet-arg=cloud-provider=external{% endif %} {% if k3s_kubelet_cloud_provider_external | bool %}--kubelet-arg=cloud-provider=external{% endif %}
register: primary_install
until: primary_install.rc == 0
retries: 3
delay: 20
when: when:
- k3s_install_needed - k3s_install_needed
- k3s_primary | default(false) - k3s_primary | default(false)
@@ -87,40 +107,14 @@
--server https://{{ k3s_join_endpoint | default(k3s_primary_ip) }}:6443 --server https://{{ k3s_join_endpoint | default(k3s_primary_ip) }}:6443
--advertise-address={{ k3s_node_ip }} --advertise-address={{ k3s_node_ip }}
--node-ip={{ k3s_node_ip }} --node-ip={{ k3s_node_ip }}
--flannel-iface=enp7s0 --flannel-iface={{ k3s_flannel_iface }}
{% if k3s_disable_embedded_ccm | bool %}--disable-cloud-controller{% endif %} {% if k3s_disable_embedded_ccm | bool %}--disable-cloud-controller{% endif %}
{% if k3s_disable_servicelb | bool %}--disable=servicelb{% endif %} {% if k3s_disable_servicelb | bool %}--disable=servicelb{% endif %}
{% if k3s_kubelet_cloud_provider_external | bool %}--kubelet-arg=cloud-provider=external{% endif %} {% if k3s_kubelet_cloud_provider_external | bool %}--kubelet-arg=cloud-provider=external{% endif %}
register: secondary_install register: secondary_install
until: secondary_install.rc == 0
rescue: retries: 3
- name: Show k3s service status after failed secondary install delay: 20
command: systemctl status k3s --no-pager
register: k3s_status_after_install
changed_when: false
failed_when: false
- name: Show recent k3s logs after failed secondary install
command: journalctl -u k3s -n 120 --no-pager
register: k3s_journal_after_install
changed_when: false
failed_when: false
- name: Fail with secondary install diagnostics
fail:
msg: |
Secondary k3s install failed on {{ inventory_hostname }}.
Install stdout:
{{ secondary_install.stdout | default('n/a') }}
Install stderr:
{{ secondary_install.stderr | default('n/a') }}
Service status:
{{ k3s_status_after_install.stdout | default('n/a') }}
Recent logs:
{{ k3s_journal_after_install.stdout | default('n/a') }}
- name: Wait for k3s to be ready - name: Wait for k3s to be ready
command: "{{ (k3s_primary | default(false)) | ternary('kubectl get nodes', 'systemctl is-active k3s') }}" command: "{{ (k3s_primary | default(false)) | ternary('kubectl get nodes', 'systemctl is-active k3s') }}"
@@ -0,0 +1,7 @@
---
kube_vip_version: v1.1.2
kube_vip_interface: "{{ ansible_default_ipv4.interface | default('eth0') }}"
kube_vip_address: "{{ kube_api_endpoint }}"
kube_vip_prepull_images:
- docker.io/rancher/mirrored-pause:3.6
- ghcr.io/kube-vip/kube-vip:{{ kube_vip_version }}
@@ -0,0 +1,102 @@
---
- name: Check for runner-provided kube-vip image archive
stat:
path: "{{ playbook_dir }}/../outputs/kube-vip-bootstrap.tar"
delegate_to: localhost
become: false
register: kube_vip_bootstrap_archive
- name: Copy runner-provided kube-vip image archive
copy:
src: "{{ playbook_dir }}/../outputs/kube-vip-bootstrap.tar"
dest: /tmp/kube-vip-bootstrap.tar
mode: "0644"
when: kube_vip_bootstrap_archive.stat.exists
- name: Import runner-provided kube-vip image archive
command: /usr/local/bin/ctr -n k8s.io images import /tmp/kube-vip-bootstrap.tar
changed_when: false
when: kube_vip_bootstrap_archive.stat.exists
- name: Pre-pull kube-vip bootstrap images into containerd
shell: |
if /usr/local/bin/ctr -n k8s.io images ls -q | grep -Fx -- "{{ item }}" >/dev/null; then
echo "already present"
exit 0
fi
for attempt in 1 2 3; do
if timeout 120s /usr/local/bin/ctr -n k8s.io images pull "{{ item }}"; then
echo "pulled image"
exit 0
fi
sleep 10
done
exit 1
args:
executable: /bin/bash
register: kube_vip_image_pull
loop: "{{ kube_vip_prepull_images }}"
changed_when: "'pulled image' in kube_vip_image_pull.stdout"
- name: Render kube-vip control plane manifest
template:
src: kube-vip-control-plane.yaml.j2
dest: /tmp/kube-vip-control-plane.yaml
mode: "0644"
- name: Apply kube-vip control plane manifest
command: kubectl apply -f /tmp/kube-vip-control-plane.yaml
register: kube_vip_apply
until: kube_vip_apply.rc == 0
retries: 3
delay: 10
changed_when: true
- name: Wait for local kube-vip pod to be ready
shell: >-
kubectl -n kube-system get pods
-l app.kubernetes.io/name=kube-vip
--field-selector spec.nodeName={{ inventory_hostname }}
-o jsonpath='{.items[0].status.conditions[?(@.type=="Ready")].status}'
register: kube_vip_pod_ready
changed_when: false
until: kube_vip_pod_ready.stdout == "True"
retries: 30
delay: 10
- name: Show kube-vip pod status on failure
command: kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip -o wide
register: kube_vip_pods
changed_when: false
failed_when: false
when: kube_vip_pod_ready is failed
- name: Describe kube-vip pod on failure
shell: >-
kubectl -n kube-system describe pod
$(kubectl -n kube-system get pods -l app.kubernetes.io/name=kube-vip --field-selector spec.nodeName={{ inventory_hostname }} -o jsonpath='{.items[0].metadata.name}')
register: kube_vip_pod_describe
changed_when: false
failed_when: false
when: kube_vip_pod_ready is failed
- name: Fail with kube-vip diagnostics
fail:
msg: |
kube-vip failed to become ready on {{ inventory_hostname }}.
Pods:
{{ kube_vip_pods.stdout | default('n/a') }}
Describe:
{{ kube_vip_pod_describe.stdout | default('n/a') }}
when: kube_vip_pod_ready is failed
- name: Wait for API VIP on 6443
wait_for:
host: "{{ kube_vip_address }}"
port: 6443
state: started
timeout: 180
@@ -0,0 +1,110 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-vip
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:kube-vip-role
rules:
- apiGroups: [""]
resources: ["services/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["services", "endpoints"]
verbs: ["list", "get", "watch", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "get", "watch", "update", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["list", "get", "watch", "update", "create"]
- apiGroups: ["discovery.k8s.io"]
resources: ["endpointslices"]
verbs: ["list", "get", "watch", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:kube-vip-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kube-vip-role
subjects:
- kind: ServiceAccount
name: kube-vip
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-vip
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: kube-vip
template:
metadata:
labels:
app.kubernetes.io/name: kube-vip
spec:
serviceAccountName: kube-vip
hostNetwork: true
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: kube-vip
image: ghcr.io/kube-vip/kube-vip:{{ kube_vip_version }}
imagePullPolicy: IfNotPresent
args:
- manager
env:
- name: vip_arp
value: "true"
- name: port
value: "6443"
- name: vip_interface
value: {{ kube_vip_interface | quote }}
- name: vip_subnet
value: "32"
- name: cp_enable
value: "true"
- name: cp_namespace
value: kube-system
- name: vip_ddns
value: "false"
- name: vip_leaderelection
value: "true"
- name: vip_leaseduration
value: "5"
- name: vip_renewdeadline
value: "3"
- name: vip_retryperiod
value: "1"
- name: address
value: {{ kube_vip_address | quote }}
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_RAW
- SYS_TIME
@@ -105,6 +105,11 @@
register: grafana_loki_labels register: grafana_loki_labels
changed_when: false changed_when: false
failed_when: false failed_when: false
until: >-
grafana_loki_labels.rc != 0 or
'"data":[]' not in (grafana_loki_labels.stdout | replace(' ', ''))
retries: 30
delay: 10
when: loki_enabled when: loki_enabled
- name: Fail when Loki is reachable but has zero indexed labels - name: Fail when Loki is reachable but has zero indexed labels
@@ -0,0 +1,6 @@
---
rancher_images_to_prepull:
- docker.io/rancher/rancher:v2.13.3
- docker.io/rancher/rancher-webhook:v0.9.3
- docker.io/rancher/system-upgrade-controller:v0.17.0
- docker.io/rancher/shell:v0.6.2
@@ -0,0 +1,59 @@
---
- name: Check for runner-provided Rancher image archives
stat:
path: "{{ playbook_dir }}/../outputs/bootstrap-image-archives/{{ item | regex_replace('[/:]', '_') }}.tar"
delegate_to: localhost
become: false
register: rancher_image_archive_stats
loop: "{{ rancher_images_to_prepull }}"
- name: Ensure remote Rancher image archive directory exists
file:
path: /tmp/bootstrap-image-archives
state: directory
mode: "0755"
- name: Copy runner-provided Rancher image archives
copy:
src: "{{ item.stat.path }}"
dest: "/tmp/bootstrap-image-archives/{{ item.item | regex_replace('[/:]', '_') }}.tar"
mode: "0644"
loop: "{{ rancher_image_archive_stats.results }}"
loop_control:
label: "{{ item.item }}"
when: item.stat.exists
- name: Import or pull Rancher images into containerd
shell: |
if /usr/local/bin/ctr -n k8s.io images ls -q | grep -Fx -- "{{ item }}" >/dev/null; then
echo "already present"
exit 0
fi
archive="/tmp/bootstrap-image-archives/{{ item | regex_replace('[/:]', '_') }}.tar"
if [ -s "${archive}" ]; then
for attempt in 1 2 3; do
if /usr/local/bin/ctr -n k8s.io images import "${archive}" && /usr/local/bin/ctr -n k8s.io images ls -q | grep -Fx -- "{{ item }}" >/dev/null; then
echo "imported image"
exit 0
fi
sleep 10
done
fi
for attempt in 1 2 3 4 5; do
if timeout 180s /usr/local/bin/ctr -n k8s.io images pull "{{ item }}"; then
echo "pulled image"
exit 0
fi
sleep 10
done
exit 1
args:
executable: /bin/bash
register: rancher_image_pull
loop: "{{ rancher_images_to_prepull }}"
changed_when: "'imported image' in rancher_image_pull.stdout or 'pulled image' in rancher_image_pull.stdout"
+20 -12
View File
@@ -9,22 +9,26 @@
Authorization: "Bearer {{ tailscale_api_key }}" Authorization: "Bearer {{ tailscale_api_key }}"
return_content: true return_content: true
register: ts_devices register: ts_devices
until: ts_devices.status == 200
retries: 5
delay: 10
- name: Find stale devices matching reserved hostnames - name: Find stale devices matching reserved hostnames
set_fact: set_fact:
stale_devices: >- stale_devices: >-
{{ ts_devices.json.devices | default([]) {{ (ts_devices.json.devices | default([])
| selectattr('hostname', 'defined') | selectattr('hostname', 'defined')
| selectattr('hostname', 'in', tailscale_reserved_hostnames) | selectattr('hostname', 'in', tailscale_reserved_hostnames)
| rejectattr('online', 'defined') | selectattr('connectedToControl', 'defined')
| list | rejectattr('connectedToControl', 'equalto', true)
+ | list
ts_devices.json.devices | default([]) +
| selectattr('hostname', 'defined') ts_devices.json.devices | default([])
| selectattr('hostname', 'in', tailscale_reserved_hostnames) | selectattr('hostname', 'defined')
| selectattr('online', 'defined') | selectattr('hostname', 'in', tailscale_reserved_hostnames)
| rejectattr('online', 'equalto', true) | selectattr('online', 'defined')
| list }} | rejectattr('online', 'equalto', true)
| list) | unique(attribute='id') | list }}
- name: Delete stale devices - name: Delete stale devices
uri: uri:
@@ -33,6 +37,10 @@
headers: headers:
Authorization: "Bearer {{ tailscale_api_key }}" Authorization: "Bearer {{ tailscale_api_key }}"
status_code: 200 status_code: 200
register: ts_delete_device
until: ts_delete_device.status == 200
retries: 3
delay: 5
loop: "{{ stale_devices }}" loop: "{{ stale_devices }}"
loop_control: loop_control:
label: "{{ item.name }} ({{ item.id }})" label: "{{ item.name }} ({{ item.id }})"
+107 -4
View File
@@ -1,14 +1,26 @@
--- ---
- name: Clean up stale Tailscale cluster node devices
hosts: localhost
connection: local
vars:
tailscale_reserved_hostnames: "{{ groups['cluster'] | default([]) | list }}"
roles:
- tailscale-cleanup
- name: Bootstrap Kubernetes cluster - name: Bootstrap Kubernetes cluster
hosts: cluster hosts: cluster
become: true become: true
gather_facts: true gather_facts: false
pre_tasks: pre_tasks:
- name: Wait for SSH - name: Wait for SSH
wait_for_connection: wait_for_connection:
delay: 10 delay: 10
timeout: 300 timeout: 600
- name: Gather facts after SSH is reachable
setup:
roles: roles:
- common - common
@@ -57,12 +69,24 @@
roles: roles:
- addon-secrets-bootstrap - addon-secrets-bootstrap
- name: Deploy Hetzner CCM (required for workers with external cloud provider) - name: Deploy kube-vip for API HA
hosts: control_plane[0] hosts: control_plane[0]
become: true become: true
roles: roles:
- ccm-deploy - kube-vip-deploy
- name: Wait for Kubernetes API VIP readiness
hosts: control_plane[0]
become: true
tasks:
- name: Wait for Kubernetes readyz through the VIP
command: kubectl --server=https://{{ kube_api_endpoint }}:6443 get --raw=/readyz
register: api_readyz
until: api_readyz.rc == 0
retries: 30
delay: 10
changed_when: false
- name: Setup secondary control planes - name: Setup secondary control planes
hosts: control_plane[1:] hosts: control_plane[1:]
@@ -80,6 +104,64 @@
roles: roles:
- k3s-server - k3s-server
- name: Export kube-vip image from primary control plane
hosts: control_plane[0]
become: true
tasks:
- name: Export kube-vip image for secondary control planes
command: >-
/usr/local/bin/ctr -n k8s.io images export
/tmp/kube-vip-bootstrap.tar
ghcr.io/kube-vip/kube-vip:v1.1.2
changed_when: false
- name: Fetch kube-vip image archive
fetch:
src: /tmp/kube-vip-bootstrap.tar
dest: ../outputs/kube-vip-bootstrap.tar
flat: true
- name: Seed kube-vip image on secondary control planes
hosts: control_plane[1:]
become: true
tasks:
- name: Copy kube-vip image archive
copy:
src: ../outputs/kube-vip-bootstrap.tar
dest: /tmp/kube-vip-bootstrap.tar
mode: "0644"
- name: Import kube-vip image into containerd
command: /usr/local/bin/ctr -n k8s.io images import /tmp/kube-vip-bootstrap.tar
register: kube_vip_secondary_import
until: kube_vip_secondary_import.rc == 0
retries: 3
delay: 10
changed_when: false
- name: Wait for all control plane nodes to be Ready
hosts: control_plane[0]
become: true
tasks:
- name: Wait for control plane node readiness
command: kubectl wait --for=condition=Ready node/{{ item }} --timeout=30s
register: control_plane_ready
until: control_plane_ready.rc == 0
retries: 20
delay: 15
changed_when: false
loop: "{{ groups['control_plane'] }}"
- name: Wait for Kubernetes readyz before worker joins
command: kubectl --server=https://{{ kube_api_endpoint }}:6443 get --raw=/readyz
register: api_readyz_before_workers
until: api_readyz_before_workers.rc == 0
retries: 30
delay: 10
changed_when: false
- name: Setup workers - name: Setup workers
hosts: workers hosts: workers
become: true become: true
@@ -93,6 +175,21 @@
roles: roles:
- k3s-agent - k3s-agent
- name: Pre-pull bootstrap control-plane images
hosts: control_plane[0]
become: true
roles:
- bootstrap-image-prepull
- name: Pre-pull Rancher bootstrap images
hosts: workers
become: true
roles:
- role: rancher-image-prepull
when: rancher_image_prepull_enabled | default(false) | bool
- name: Deploy observability stack - name: Deploy observability stack
hosts: control_plane[0] hosts: control_plane[0]
become: true become: true
@@ -148,10 +245,16 @@
hosts: localhost hosts: localhost
connection: local connection: local
tasks: tasks:
- name: Check whether kubeconfig was fetched
stat:
path: ../outputs/kubeconfig
register: kubeconfig_file
- name: Update kubeconfig server address - name: Update kubeconfig server address
command: | command: |
sed -i 's/127.0.0.1/{{ hostvars[groups["control_plane"][0]]["ansible_host"] }}/g' ../outputs/kubeconfig sed -i 's/127.0.0.1/{{ hostvars[groups["control_plane"][0]]["ansible_host"] }}/g' ../outputs/kubeconfig
changed_when: true changed_when: true
when: kubeconfig_file.stat.exists
- name: Display success message - name: Display success message
debug: debug:
@@ -8,6 +8,10 @@ spec:
spec: spec:
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
--- ---
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
@@ -19,6 +23,10 @@ spec:
spec: spec:
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
--- ---
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
@@ -30,6 +38,10 @@ spec:
spec: spec:
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
--- ---
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
@@ -41,3 +53,7 @@ spec:
spec: spec:
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
@@ -1,36 +0,0 @@
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: hcloud-cloud-controller-manager
namespace: flux-system
spec:
interval: 10m
targetNamespace: kube-system
chart:
spec:
chart: hcloud-cloud-controller-manager
version: 1.30.1
sourceRef:
kind: HelmRepository
name: hcloud
namespace: flux-system
install:
createNamespace: true
remediation:
retries: 3
upgrade:
remediation:
retries: 3
values:
selectorLabels:
app: hcloud-cloud-controller-manager
args:
secure-port: "0"
networking:
enabled: true
nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1
additionalTolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: hcloud
namespace: flux-system
spec:
interval: 1h
url: https://charts.hetzner.cloud
@@ -1,5 +0,0 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- helmrepository-hcloud.yaml
- helmrelease-hcloud-ccm.yaml
@@ -5,14 +5,14 @@ metadata:
namespace: flux-system namespace: flux-system
spec: spec:
interval: 10m interval: 10m
timeout: 15m
targetNamespace: cert-manager targetNamespace: cert-manager
chart: chart:
spec: spec:
chart: cert-manager chart: ./infrastructure/charts/cert-manager
version: "v1.17.2"
sourceRef: sourceRef:
kind: HelmRepository kind: GitRepository
name: jetstack name: platform
namespace: flux-system namespace: flux-system
install: install:
createNamespace: true createNamespace: true
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: jetstack
namespace: flux-system
spec:
interval: 1h
url: https://charts.jetstack.io
@@ -2,5 +2,4 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- namespace.yaml - namespace.yaml
- helmrepository-cert-manager.yaml
- helmrelease-cert-manager.yaml - helmrelease-cert-manager.yaml
@@ -1,36 +0,0 @@
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: hcloud-csi
namespace: flux-system
spec:
interval: 10m
targetNamespace: kube-system
chart:
spec:
chart: hcloud-csi
version: 2.20.0
sourceRef:
kind: HelmRepository
name: hcloud
namespace: flux-system
install:
createNamespace: true
remediation:
retries: 3
upgrade:
remediation:
retries: 3
values:
controller:
nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
hcloudVolumeDefaultLocation: nbg1
storageClasses:
- name: hcloud-volumes
defaultStorageClass: true
reclaimPolicy: Delete
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: hcloud
namespace: flux-system
spec:
interval: 1h
url: https://charts.hetzner.cloud
@@ -1,5 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1 apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- backup-recurring.yaml - clustersecretstore-doppler-hetznerterra.yaml
- restore-from-b2.yaml
@@ -6,14 +6,10 @@ metadata:
spec: spec:
interval: 10m interval: 10m
targetNamespace: external-secrets targetNamespace: external-secrets
chart: chartRef:
spec: kind: OCIRepository
chart: external-secrets name: external-secrets
version: 2.1.0 namespace: flux-system
sourceRef:
kind: HelmRepository
name: external-secrets
namespace: flux-system
install: install:
createNamespace: true createNamespace: true
remediation: remediation:
@@ -23,13 +19,25 @@ spec:
retries: 3 retries: 3
values: values:
installCRDs: true installCRDs: true
image:
repository: oci.external-secrets.io/external-secrets/external-secrets
tag: v2.1.0
pullPolicy: IfNotPresent
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
webhook: webhook:
failurePolicy: Ignore failurePolicy: Ignore
image:
repository: oci.external-secrets.io/external-secrets/external-secrets
tag: v2.1.0
pullPolicy: IfNotPresent
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
certController: certController:
image:
repository: oci.external-secrets.io/external-secrets/external-secrets
tag: v2.1.0
pullPolicy: IfNotPresent
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
serviceMonitor: serviceMonitor:
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: external-secrets
namespace: flux-system
spec:
interval: 1h
url: https://charts.external-secrets.io
@@ -2,5 +2,5 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- namespace.yaml - namespace.yaml
- helmrepository-external-secrets.yaml - ocirepository-external-secrets.yaml
- helmrelease-external-secrets.yaml - helmrelease-external-secrets.yaml
@@ -0,0 +1,13 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
name: external-secrets
namespace: flux-system
spec:
interval: 10m
url: oci://ghcr.io/external-secrets/charts/external-secrets
ref:
tag: 2.1.0
layerSelector:
mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip
operation: copy
@@ -1,15 +0,0 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: addon-ccm
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: platform
path: ./infrastructure/addons/ccm
wait: true
timeout: 10m
suspend: false
@@ -11,5 +11,5 @@ spec:
name: platform name: platform
path: ./infrastructure/addons/cert-manager path: ./infrastructure/addons/cert-manager
wait: true wait: true
timeout: 10m timeout: 20m
suspend: false suspend: false
@@ -1,17 +0,0 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: addon-csi
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: platform
path: ./infrastructure/addons/csi
dependsOn:
- name: addon-ccm
wait: true
timeout: 10m
suspend: false
@@ -0,0 +1,21 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: addon-external-secrets-store
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: platform
path: ./infrastructure/addons/external-secrets-store
dependsOn:
- name: addon-external-secrets
wait: false
healthChecks:
- apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
name: doppler-hetznerterra
timeout: 5m
suspend: false
@@ -10,6 +10,19 @@ spec:
kind: GitRepository kind: GitRepository
name: platform name: platform
path: ./infrastructure/addons/external-secrets path: ./infrastructure/addons/external-secrets
wait: true wait: false
timeout: 5m healthChecks:
- apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
name: external-secrets
namespace: flux-system
- apiVersion: apps/v1
kind: Deployment
name: external-secrets-external-secrets
namespace: external-secrets
- apiVersion: apps/v1
kind: Deployment
name: external-secrets-external-secrets-webhook
namespace: external-secrets
timeout: 10m
suspend: false suspend: false
@@ -1,7 +1,7 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1 apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization kind: Kustomization
metadata: metadata:
name: addon-rancher-backup name: addon-nfs-storage
namespace: flux-system namespace: flux-system
spec: spec:
interval: 10m interval: 10m
@@ -9,10 +9,12 @@ spec:
sourceRef: sourceRef:
kind: GitRepository kind: GitRepository
name: platform name: platform
path: ./infrastructure/addons/rancher-backup path: ./infrastructure/addons/nfs-storage
wait: true wait: true
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: nfs-subdir-external-provisioner
namespace: kube-system
timeout: 10m timeout: 10m
suspend: false suspend: false
dependsOn:
- name: addon-external-secrets
- name: addon-rancher
@@ -0,0 +1,26 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: addon-observability-secrets
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: platform
path: ./infrastructure/addons/observability-secrets
dependsOn:
- name: addon-external-secrets-store
wait: false
healthChecks:
- apiVersion: external-secrets.io/v1
kind: ExternalSecret
name: grafana-admin
namespace: observability
- apiVersion: v1
kind: Secret
name: grafana-admin-credentials
namespace: observability
timeout: 5m
suspend: false
@@ -11,9 +11,23 @@ spec:
name: platform name: platform
path: ./infrastructure/addons/observability path: ./infrastructure/addons/observability
dependsOn: dependsOn:
- name: addon-external-secrets - name: addon-observability-secrets
- name: addon-nfs-storage
- name: addon-tailscale-operator - name: addon-tailscale-operator
- name: addon-tailscale-proxyclass - name: addon-tailscale-proxyclass
wait: true wait: false
timeout: 5m healthChecks:
- apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
name: kube-prometheus-stack
namespace: flux-system
- apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
name: loki
namespace: flux-system
- apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
name: promtail
namespace: flux-system
timeout: 30m
suspend: false suspend: false
@@ -1,16 +0,0 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: addon-rancher-backup-config
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: platform
path: ./infrastructure/addons/rancher-backup-config
timeout: 5m
suspend: false
dependsOn:
- name: addon-rancher-backup
@@ -13,5 +13,5 @@ spec:
dependsOn: dependsOn:
- name: addon-rancher - name: addon-rancher
wait: true wait: true
timeout: 5m timeout: 10m
suspend: false suspend: false
@@ -0,0 +1,34 @@
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: addon-rancher-secrets
namespace: flux-system
spec:
interval: 10m
prune: true
sourceRef:
kind: GitRepository
name: platform
path: ./infrastructure/addons/rancher-secrets
dependsOn:
- name: addon-external-secrets-store
wait: false
healthChecks:
- apiVersion: external-secrets.io/v1
kind: ExternalSecret
name: rancher-bootstrap-password
namespace: flux-system
- apiVersion: v1
kind: Secret
name: rancher-bootstrap-password
namespace: flux-system
- apiVersion: external-secrets.io/v1
kind: ExternalSecret
name: rancher-bootstrap-password
namespace: cattle-system
- apiVersion: v1
kind: Secret
name: rancher-bootstrap-password
namespace: cattle-system
timeout: 5m
suspend: false
@@ -10,11 +10,32 @@ spec:
kind: GitRepository kind: GitRepository
name: platform name: platform
path: ./infrastructure/addons/rancher path: ./infrastructure/addons/rancher
wait: true timeout: 30m
timeout: 15m
suspend: false suspend: false
dependsOn: dependsOn:
- name: addon-tailscale-operator - name: addon-tailscale-operator
- name: addon-tailscale-proxyclass - name: addon-tailscale-proxyclass
- name: addon-external-secrets - name: addon-rancher-secrets
- name: addon-cert-manager - name: addon-cert-manager
wait: false
healthChecks:
- apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
name: rancher
namespace: flux-system
- apiVersion: apps/v1
kind: Deployment
name: cattle-system-rancher
namespace: cattle-system
- apiVersion: apps/v1
kind: Deployment
name: rancher-webhook
namespace: cattle-system
- apiVersion: cert-manager.io/v1
kind: Issuer
name: cattle-system-rancher
namespace: cattle-system
- apiVersion: cert-manager.io/v1
kind: Certificate
name: tls-rancher-ingress
namespace: cattle-system
@@ -10,6 +10,6 @@ spec:
kind: GitRepository kind: GitRepository
name: platform name: platform
path: ./infrastructure/addons/tailscale-operator path: ./infrastructure/addons/tailscale-operator
wait: true wait: false
timeout: 5m timeout: 10m
suspend: false suspend: false
+4 -4
View File
@@ -1,16 +1,16 @@
apiVersion: kustomize.config.k8s.io/v1beta1 apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- kustomization-ccm.yaml - kustomization-nfs-storage.yaml
- kustomization-csi.yaml
- kustomization-external-secrets.yaml - kustomization-external-secrets.yaml
- kustomization-external-secrets-store.yaml
- kustomization-cert-manager.yaml - kustomization-cert-manager.yaml
- kustomization-tailscale-operator.yaml - kustomization-tailscale-operator.yaml
- kustomization-tailscale-proxyclass.yaml - kustomization-tailscale-proxyclass.yaml
- traefik - traefik
- kustomization-observability-secrets.yaml
- kustomization-observability.yaml - kustomization-observability.yaml
- kustomization-observability-content.yaml - kustomization-observability-content.yaml
- kustomization-rancher-secrets.yaml
- kustomization-rancher.yaml - kustomization-rancher.yaml
- kustomization-rancher-config.yaml - kustomization-rancher-config.yaml
- kustomization-rancher-backup.yaml
- kustomization-rancher-backup-config.yaml
@@ -0,0 +1,20 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: nfs-subdir-external-provisioner-runner
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update", "patch"]
@@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: run-nfs-subdir-external-provisioner
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: nfs-subdir-external-provisioner-runner
subjects:
- kind: ServiceAccount
name: nfs-subdir-external-provisioner
namespace: kube-system
@@ -0,0 +1,41 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: nfs-subdir-external-provisioner
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: nfs-subdir-external-provisioner
template:
metadata:
labels:
app: nfs-subdir-external-provisioner
spec:
serviceAccountName: nfs-subdir-external-provisioner
nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: nfs-subdir-external-provisioner
image: registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2
imagePullPolicy: IfNotPresent
env:
- name: PROVISIONER_NAME
value: flash-nfs
- name: NFS_SERVER
value: 10.27.27.239
- name: NFS_PATH
value: /TheFlash/k8s-nfs
volumeMounts:
- name: nfs-subdir-external-provisioner-root
mountPath: /persistentvolumes
volumes:
- name: nfs-subdir-external-provisioner-root
nfs:
server: 10.27.27.239
path: /TheFlash/k8s-nfs
@@ -0,0 +1,10 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- serviceaccount-nfs-subdir-external-provisioner.yaml
- clusterrole-nfs-subdir-external-provisioner.yaml
- clusterrolebinding-nfs-subdir-external-provisioner.yaml
- role-nfs-subdir-external-provisioner.yaml
- rolebinding-nfs-subdir-external-provisioner.yaml
- storageclass-flash-nfs.yaml
- deployment-nfs-subdir-external-provisioner.yaml
@@ -0,0 +1,9 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: leader-locking-nfs-subdir-external-provisioner
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
@@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: leader-locking-nfs-subdir-external-provisioner
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: leader-locking-nfs-subdir-external-provisioner
subjects:
- kind: ServiceAccount
name: nfs-subdir-external-provisioner
namespace: kube-system
@@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-subdir-external-provisioner
namespace: kube-system
@@ -0,0 +1,12 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: flash-nfs
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: flash-nfs
parameters:
archiveOnDelete: "true"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
@@ -1,5 +1,5 @@
apiVersion: kustomize.config.k8s.io/v1beta1 apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- helmrepository-hcloud.yaml - namespace.yaml
- helmrelease-hcloud-csi.yaml - grafana-admin-externalsecret.yaml
@@ -5,14 +5,14 @@ metadata:
namespace: flux-system namespace: flux-system
spec: spec:
interval: 10m interval: 10m
timeout: 15m
targetNamespace: observability targetNamespace: observability
chart: chart:
spec: spec:
chart: kube-prometheus-stack chart: ./infrastructure/charts/kube-prometheus-stack
version: 68.4.4
sourceRef: sourceRef:
kind: HelmRepository kind: GitRepository
name: prometheus-community name: platform
namespace: flux-system namespace: flux-system
install: install:
createNamespace: true createNamespace: true
@@ -21,6 +21,7 @@ spec:
upgrade: upgrade:
remediation: remediation:
retries: 3 retries: 3
strategy: uninstall
values: values:
grafana: grafana:
enabled: true enabled: true
@@ -6,14 +6,10 @@ metadata:
spec: spec:
interval: 10m interval: 10m
targetNamespace: observability targetNamespace: observability
chart: chartRef:
spec: kind: OCIRepository
chart: loki name: loki
version: 6.10.0 namespace: flux-system
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
install: install:
createNamespace: true createNamespace: true
remediation: remediation:
@@ -50,7 +46,7 @@ spec:
replicas: 1 replicas: 1
persistence: persistence:
size: 10Gi size: 10Gi
storageClass: local-path storageClass: flash-nfs
resources: resources:
requests: requests:
cpu: 100m cpu: 100m
@@ -87,11 +83,11 @@ spec:
test: test:
enabled: false enabled: false
chunksCache: chunksCache:
enabled: true enabled: false
allocatedMemory: 128
resultsCache: resultsCache:
enabled: true enabled: false
allocatedMemory: 128 lokiCanary:
enabled: false
monitoring: monitoring:
selfMonitoring: selfMonitoring:
enabled: false enabled: false
@@ -5,15 +5,12 @@ metadata:
namespace: flux-system namespace: flux-system
spec: spec:
interval: 10m interval: 10m
timeout: 20m
targetNamespace: observability targetNamespace: observability
chart: chartRef:
spec: kind: OCIRepository
chart: promtail name: promtail
version: 6.16.6 namespace: flux-system
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
install: install:
createNamespace: true createNamespace: true
remediation: remediation:
@@ -22,6 +19,8 @@ spec:
remediation: remediation:
retries: 3 retries: 3
values: values:
image:
pullPolicy: IfNotPresent
config: config:
clients: clients:
- url: http://loki.observability.svc.cluster.local:3100/loki/api/v1/push - url: http://observability-loki.observability.svc.cluster.local:3100/loki/api/v1/push
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: grafana
namespace: flux-system
spec:
interval: 1h
url: https://grafana.github.io/helm-charts
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: prometheus-community
namespace: flux-system
spec:
interval: 1h
url: https://prometheus-community.github.io/helm-charts
@@ -1,10 +1,8 @@
apiVersion: kustomize.config.k8s.io/v1beta1 apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- namespace.yaml - ocirepository-loki.yaml
- grafana-admin-externalsecret.yaml - ocirepository-promtail.yaml
- helmrepository-prometheus-community.yaml
- helmrepository-grafana.yaml
- helmrelease-kube-prometheus-stack.yaml - helmrelease-kube-prometheus-stack.yaml
- helmrelease-loki.yaml - helmrelease-loki.yaml
- helmrelease-promtail.yaml - helmrelease-promtail.yaml
@@ -0,0 +1,13 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
name: loki
namespace: flux-system
spec:
interval: 10m
url: oci://ghcr.io/grafana/helm-charts/loki
ref:
tag: 6.46.0
layerSelector:
mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip
operation: copy
@@ -0,0 +1,13 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
name: promtail
namespace: flux-system
spec:
interval: 10m
url: oci://ghcr.io/grafana/helm-charts/promtail
ref:
tag: 6.16.6
layerSelector:
mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip
operation: copy
@@ -1,17 +0,0 @@
apiVersion: resources.cattle.io/v1
kind: Backup
metadata:
name: rancher-b2-recurring
namespace: cattle-resources-system
spec:
resourceSetName: rancher-resource-set-full
storageLocation:
s3:
credentialSecretName: rancher-b2-creds
credentialSecretNamespace: cattle-resources-system
bucketName: HetznerTerra
folder: rancher-backups
endpoint: s3.us-east-005.backblazeb2.com
region: us-east-005
schedule: "0 3 * * *"
retentionCount: 7
@@ -1,19 +0,0 @@
# Uncomment and set backupFilename to restore from a specific backup on rebuild.
# Find the latest backup filename in B2: rancher-backups/ folder.
# After restore succeeds, Rancher will have all users/settings from the backup.
#
# apiVersion: resources.cattle.io/v1
# kind: Restore
# metadata:
# name: restore-from-b2
# namespace: cattle-resources-system
# spec:
# backupFilename: rancher-b2-manual-test-0a416444-2c8a-4d34-8a07-d9e406750374-2026-03-30T00-08-02Z.tar.gz
# storageLocation:
# s3:
# credentialSecretName: rancher-b2-creds
# credentialSecretNamespace: cattle-resources-system
# bucketName: HetznerTerra
# folder: rancher-backups
# endpoint: s3.us-east-005.backblazeb2.com
# region: us-east-005
@@ -1,25 +0,0 @@
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: rancher-b2-creds
namespace: cattle-resources-system
spec:
refreshInterval: 1h
secretStoreRef:
name: doppler-hetznerterra
kind: ClusterSecretStore
target:
name: rancher-b2-creds
creationPolicy: Owner
template:
type: Opaque
data:
accessKey: "{{ .B2_ACCOUNT_ID }}"
secretKey: "{{ .B2_APPLICATION_KEY }}"
data:
- secretKey: B2_ACCOUNT_ID
remoteRef:
key: B2_ACCOUNT_ID
- secretKey: B2_APPLICATION_KEY
remoteRef:
key: B2_APPLICATION_KEY
@@ -1,23 +0,0 @@
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: rancher-backup-crd
namespace: flux-system
spec:
interval: 10m
targetNamespace: cattle-resources-system
chart:
spec:
chart: rancher-backup-crd
version: "106.0.2+up8.1.0"
sourceRef:
kind: HelmRepository
name: rancher-charts
namespace: flux-system
install:
createNamespace: true
remediation:
retries: 3
upgrade:
remediation:
retries: 3
@@ -1,42 +0,0 @@
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: rancher-backup
namespace: flux-system
spec:
interval: 10m
targetNamespace: cattle-resources-system
dependsOn:
- name: rancher-backup-crd
chart:
spec:
chart: rancher-backup
version: "106.0.2+up8.1.0"
sourceRef:
kind: HelmRepository
name: rancher-charts
namespace: flux-system
install:
createNamespace: true
remediation:
retries: 3
upgrade:
remediation:
retries: 3
values:
image:
repository: rancher/backup-restore-operator
kubectl:
image:
repository: rancher/kubectl
tag: "v1.34.0"
postRenderers:
- kustomize:
patches:
- target:
kind: Job
name: rancher-backup-patch-sa
patch: |
- op: replace
path: /spec/template/spec/containers/0/image
value: rancher/kubectl:v1.34.0
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: rancher-charts
namespace: flux-system
spec:
interval: 1h
url: https://charts.rancher.io
@@ -1,8 +0,0 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- helmrepository-rancher-backup.yaml
- helmrelease-rancher-backup-crd.yaml
- helmrelease-rancher-backup.yaml
- b2-credentials-externalsecret.yaml
@@ -1,4 +0,0 @@
apiVersion: v1
kind: Namespace
metadata:
name: cattle-resources-system
@@ -0,0 +1,6 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- rancher-bootstrap-password-flux-externalsecret.yaml
- rancher-bootstrap-password-externalsecret.yaml
@@ -5,14 +5,14 @@ metadata:
namespace: flux-system namespace: flux-system
spec: spec:
interval: 10m interval: 10m
timeout: 15m
targetNamespace: cattle-system targetNamespace: cattle-system
chart: chart:
spec: spec:
chart: rancher chart: ./infrastructure/charts/rancher
version: "2.13.3"
sourceRef: sourceRef:
kind: HelmRepository kind: GitRepository
name: rancher-stable name: platform
namespace: flux-system namespace: flux-system
install: install:
createNamespace: true createNamespace: true
@@ -23,10 +23,18 @@ spec:
retries: 3 retries: 3
values: values:
hostname: rancher.silverside-gopher.ts.net hostname: rancher.silverside-gopher.ts.net
systemDefaultRegistry: registry.rancher.com
replicas: 1 replicas: 1
extraEnv: extraEnv:
- name: CATTLE_PROMETHEUS_METRICS - name: CATTLE_PROMETHEUS_METRICS
value: "true" value: "true"
- name: CATTLE_FEATURES
value: "managed-system-upgrade-controller=false"
webhook:
image:
repository: rancher/rancher-webhook
tag: v0.9.3
imagePullPolicy: IfNotPresent
resources: resources:
requests: requests:
cpu: 500m cpu: 500m
@@ -34,6 +42,10 @@ spec:
limits: limits:
cpu: 1000m cpu: 1000m
memory: 1Gi memory: 1Gi
startupProbe:
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 60
affinity: affinity:
nodeAffinity: nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: requiredDuringSchedulingIgnoredDuringExecution:
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: rancher-stable
namespace: flux-system
spec:
interval: 1h
url: https://releases.rancher.com/server-charts/stable
@@ -1,9 +1,5 @@
apiVersion: kustomize.config.k8s.io/v1beta1 apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- namespace.yaml
- helmrepository-rancher.yaml
- helmrelease-rancher.yaml - helmrelease-rancher.yaml
- rancher-bootstrap-password-flux-externalsecret.yaml
- rancher-bootstrap-password-externalsecret.yaml
- rancher-tailscale-service.yaml - rancher-tailscale-service.yaml
@@ -8,11 +8,10 @@ spec:
targetNamespace: tailscale-system targetNamespace: tailscale-system
chart: chart:
spec: spec:
chart: tailscale-operator chart: ./infrastructure/charts/tailscale-operator
version: 1.96.5
sourceRef: sourceRef:
kind: HelmRepository kind: GitRepository
name: tailscale name: platform
namespace: flux-system namespace: flux-system
install: install:
createNamespace: true createNamespace: true
@@ -28,6 +27,10 @@ spec:
operatorConfig: operatorConfig:
defaultTags: defaultTags:
- tag:k8s - tag:k8s
image:
repository: ghcr.io/tailscale/k8s-operator
tag: v1.96.5
pullPolicy: IfNotPresent
nodeSelector: nodeSelector:
kubernetes.io/hostname: k8s-cluster-cp-1 kubernetes.io/hostname: k8s-cluster-cp-1
tolerations: tolerations:
@@ -37,3 +40,6 @@ spec:
proxyConfig: proxyConfig:
defaultTags: tag:k8s defaultTags: tag:k8s
defaultProxyClass: infra-stable defaultProxyClass: infra-stable
image:
repository: ghcr.io/tailscale/tailscale
tag: v1.96.5
@@ -1,8 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: tailscale
namespace: flux-system
spec:
interval: 1h
url: https://pkgs.tailscale.com/helmcharts
@@ -2,5 +2,4 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- namespace.yaml - namespace.yaml
- helmrepository-tailscale.yaml
- helmrelease-tailscale-operator.yaml - helmrelease-tailscale-operator.yaml
@@ -8,11 +8,10 @@ spec:
targetNamespace: kube-system targetNamespace: kube-system
chart: chart:
spec: spec:
chart: traefik chart: ./infrastructure/charts/traefik
version: "39.0.0"
sourceRef: sourceRef:
kind: HelmRepository kind: GitRepository
name: traefik name: platform
namespace: flux-system namespace: flux-system
install: install:
createNamespace: true createNamespace: true
@@ -1,9 +0,0 @@
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: traefik
namespace: flux-system
spec:
interval: 10m
url: https://traefik.github.io/charts
provider: generic
@@ -1,5 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1 apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization kind: Kustomization
resources: resources:
- helmrepository-traefik.yaml
- helmrelease-traefik.yaml - helmrelease-traefik.yaml
@@ -0,0 +1,26 @@
annotations:
artifacthub.io/category: security
artifacthub.io/license: Apache-2.0
artifacthub.io/prerelease: "false"
artifacthub.io/signKey: |
fingerprint: 1020CF3C033D4F35BAE1C19E1226061C665DF13E
url: https://cert-manager.io/public-keys/cert-manager-keyring-2021-09-20-1020CF3C033D4F35BAE1C19E1226061C665DF13E.gpg
apiVersion: v2
appVersion: v1.17.2
description: A Helm chart for cert-manager
home: https://cert-manager.io
icon: https://raw.githubusercontent.com/cert-manager/community/4d35a69437d21b76322157e6284be4cd64e6d2b7/logo/logo-small.png
keywords:
- cert-manager
- kube-lego
- letsencrypt
- tls
kubeVersion: '>= 1.22.0-0'
maintainers:
- email: cert-manager-maintainers@googlegroups.com
name: cert-manager-maintainers
url: https://cert-manager.io
name: cert-manager
sources:
- https://github.com/cert-manager/cert-manager
version: v1.17.2
File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More