Commit Graph

335 Commits

Author SHA1 Message Date
micqdf d0be48b65c fix: gate Tailscale addon on Helm release
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 36m36s
2026-04-25 21:21:34 +00:00
micqdf 40647318b4 fix: tolerate cached Helm repository artifacts
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 29m36s
2026-04-25 20:44:03 +00:00
micqdf cdb26904d2 fix: retry Tailscale chart pulls during bootstrap
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 27m40s
2026-04-25 20:11:43 +00:00
micqdf 3c06e046c2 fix: warm External Secrets image before install
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 21m10s
2026-04-25 19:46:21 +00:00
micqdf 17f1815e7f fix: use CRI pulls for Flux image warmup
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 15m3s
2026-04-25 19:28:29 +00:00
micqdf 66e86e55ea fix: require Flux image warmup before bootstrap
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 23m13s
2026-04-25 19:02:32 +00:00
micqdf 43df412243 fix: handle missing Proxmox VM config during cleanup
Deploy Cluster / Terraform (push) Successful in 1m41s
Deploy Cluster / Ansible (push) Failing after 44m51s
2026-04-25 17:40:51 +00:00
micqdf 383ef9e9ac fix: clean orphan Proxmox cloud-init volumes
Deploy Cluster / Terraform (push) Failing after 19s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-25 17:38:57 +00:00
micqdf 18abc5073b fix: keep concurrent Terraform apply
Deploy Cluster / Terraform (push) Failing after 1m28s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-25 17:30:59 +00:00
micqdf f8da2594ca fix: serialize Proxmox VM apply
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-04-25 17:27:59 +00:00
micqdf e0359f0097 tes
Deploy Cluster / Terraform (push) Failing after 1m26s
Deploy Cluster / Ansible (push) Has been skipped
2026-04-25 17:22:12 +00:00
micqdf 003333a061 fix: make health checks observe Flux readiness
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Successful in 11m14s
2026-04-25 03:52:43 +00:00
micqdf a6071c504b fix: point Promtail at Loki service
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 03:43:23 +00:00
micqdf 08123457f1 fix: ignore stale install hook pods in health check
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 03:41:00 +00:00
micqdf 757d88ed52 fix: use cached Promtail images when available
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 13m15s
2026-04-25 03:25:44 +00:00
micqdf 15defc686f fix: allow slow Promtail image pulls
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 03:10:47 +00:00
micqdf abb7578328 fix: run post-deploy checks with bash
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 12m17s
2026-04-25 02:42:54 +00:00
micqdf bc87a7ca43 fix: avoid immutable observability PVC changes
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 10m47s
2026-04-25 02:25:40 +00:00
micqdf 045880bdd6 fix: ignore stale Rancher helm operation pods
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 02:23:30 +00:00
micqdf bfcf57bcc5 fix: enforce post-deploy health checks
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-25 02:22:16 +00:00
micqdf 7e3ebec95b fix: wait for Rancher resources before rollout checks
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Successful in 17m31s
2026-04-25 01:54:21 +00:00
micqdf 0c31c3b1d5 fix: fail fast on stalled Flux Helm releases
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 10m33s
2026-04-25 01:40:42 +00:00
micqdf 5523feb563 fix: wait for Rancher Flux resources before rollout
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 39m43s
2026-04-25 00:59:16 +00:00
micqdf cafa2fa0b3 fix: reset stalled bootstrap Helm releases
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 9m5s
2026-04-25 00:48:33 +00:00
micqdf a7fd4c0b97 fix: wait on actual ESO deployment names
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 38m19s
2026-04-25 00:07:48 +00:00
micqdf e56a3a6c38 fix: wait for ESO webhook before ClusterSecretStore
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 10m13s
2026-04-24 23:13:03 +00:00
micqdf 7b2eca07ab fix: pull external-secrets chart from OCI
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 9m41s
2026-04-24 15:24:58 +00:00
micqdf 347ca041ba fix: reduce rerun bootstrap pre-pull delays
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 39m26s
2026-04-24 12:09:34 +00:00
micqdf 3f52bad854 fix: make Ansible reruns faster and idempotent
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-24 11:44:11 +00:00
micqdf c89c31adea fix: clean up Ansible bootstrap warnings
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-24 11:07:13 +00:00
micqdf 68b293efe4 fix: qualify Flux HelmChart bootstrap resources
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Has been cancelled
2026-04-24 10:47:13 +00:00
micqdf 1f465cc0c1 fix: force reconcile bootstrap Helm charts
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 15m37s
2026-04-24 10:17:49 +00:00
micqdf 6e22bd26b3 fix: wait directly on ESO Helm readiness
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 47m9s
2026-04-23 22:09:45 +00:00
micqdf 869880c152 fix: wait for ESO resources before CRD conditions
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 31m14s
2026-04-23 21:17:44 +00:00
micqdf 31e95eb227 fix: pre-pull Flux controllers before bootstrap rollout
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 16m39s
2026-04-23 20:36:57 +00:00
micqdf 12675417bd fix: use correct namespace and deployment name for ESO rollout check
Deploy Cluster / Terraform (push) Successful in 1m36s
Deploy Cluster / Ansible (push) Failing after 40m40s
The ESO deployment is named external-secrets-external-secrets in the
external-secrets namespace, not external-secrets in kube-system.
2026-04-23 19:00:15 +00:00
micqdf 8e081ddfda fix: wait on ESO deployment directly instead of Flux Kustomization status
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 19m8s
The addon-external-secrets Flux Kustomization was timing out during bootstrap
because image pulls on fresh Proxmox VMs are slow. The critical dependency is
the ESO deployment being available for the Doppler ClusterSecretStore. Replace
the Kustomization readiness check with direct checks for ESO CRD establishment
and deployment rollout, which are the actual prerequisites for the next step.
2026-04-23 07:32:19 +00:00
micqdf 4b7517c9c5 fix: health-check external-secrets addon via HelmRelease only
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 17m22s
The external-secrets Kustomization was still using wait=true, which makes Flux
hold the addon in a failed state when the HelmRepository has transient fetch
errors even though the HelmRelease and runtime controller deployments are
healthy. Switch it to an explicit HelmRelease health check like the other
helm-backed addons.
2026-04-23 07:11:21 +00:00
micqdf f9bc53723f fix: make image pre-pull roles fully best effort
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 22m46s
The pre-pull roles were still blocking the playbook because they retried until
success and exhausted their retry budget during registry TLS timeouts. Keep the
image pulls as opportunistic cache warmers, but never let them fail the
bootstrap; log any missed images instead.
2026-04-23 06:41:21 +00:00
micqdf ee6417c18e fix: pre-pull core bootstrap images on cp1 before Flux bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
Fresh clusters were repeatedly timing out while kubelet pulled the pause image,
k3s packaged component images, and Flux controller images onto the first
control plane. Pre-pull the core control-plane bootstrap images into
containerd on cp-1 so Flux and packaged addons start from a warm cache instead
of racing registry TLS timeouts.
2026-04-23 05:55:14 +00:00
micqdf 1156dc0203 fix: pre-pull kube-vip images before waiting for VIP
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 43m31s
The primary control plane was stalling because kubelet still had to pull both
the Rancher pause image and the kube-vip image before the DaemonSet pod could
become Ready. Pre-pull those images into containerd, extend the readiness wait,
and emit pod diagnostics if kube-vip still does not come up.
2026-04-23 03:55:52 +00:00
micqdf 4151027e01 fix: clean stale Tailscale node devices before bootstrap
Deploy Cluster / Terraform (push) Successful in 1m40s
Deploy Cluster / Ansible (push) Failing after 14m30s
Run the Tailscale cleanup role against the cluster hostnames before any node
reconnects to the tailnet. This removes stale offline cp/worker devices from
previous rebuilds so replacement VMs can reclaim their original hostnames
instead of getting -1 suffixes.
2026-04-23 03:25:17 +00:00
micqdf 9269e9df1b docs: add guide for deploying app repos to the cluster
Deploy Cluster / Terraform (push) Successful in 1m36s
Deploy Cluster / Ansible (push) Has been cancelled
Document the recommended two-repo model for application delivery, including
Flux attachment objects, Doppler/ExternalSecret wiring, Tailscale service
exposure, and the steps for enabling the suspended apps layer.
2026-04-23 02:43:00 +00:00
micqdf d9374bc209 fix: remove duplicate wait keys from helm addon kustomizations
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
The repo-only Kustomization healthCheck change accidentally left the original
wait:true keys in the Rancher and Rancher backup Kustomizations, which broke
the infrastructure kustomize build. Remove the duplicate keys so Flux can
apply the HelmRelease-only health checks cleanly.
2026-04-23 02:20:57 +00:00
micqdf c570a476b5 fix: make helm-based addon kustomizations health-check HelmReleases only
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
These addon Kustomizations were using wait=true, which made Flux treat transient
HelmRepository fetch timeouts as addon failures even when the HelmRelease and
runtime workloads were healthy. Switch the affected Kustomizations to explicit
HelmRelease healthChecks so readiness reflects the actual deployed platform
state instead of repository fetch flakiness.
2026-04-23 02:15:45 +00:00
micqdf a7f11ccf94 fix: give Rancher more time to pass startup probe during upgrades
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Successful in 18m59s
Rancher needs longer than the chart default 2-minute startup probe budget on
this cluster while it restores local catalogs and finishes API startup. Extend
the startup probe failure threshold so Helm upgrades can complete instead of
restarting the new pod before it becomes ready.
2026-04-23 01:44:25 +00:00
micqdf a7d540ca65 fix: stop forcing Flux releases during deploy bootstrap
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Successful in 21m12s
Remove the HelmRelease reset/force annotations from the deploy workflow now
that the cluster can converge on its own. The runtime waits remain, but CI no
longer re-triggers Rancher and NFS churn on every bootstrap attempt.
2026-04-23 00:35:31 +00:00
micqdf 098bd98876 fix: wait on Rancher and storage runtime objects during bootstrap
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 25m19s
Flux can leave HelmRelease and Kustomization conditions stale after transient
chart fetch or image pull failures even when the underlying workloads recover.
Switch the deploy workflow to wait on the concrete runtime resources we care
about: the NFS provisioner deployment and StorageClass, Rancher deployment,
webhook, cert-manager issuer/certificate, and the rancher-backup deployment.
2026-04-22 18:41:09 +00:00
micqdf 55d7b8201e fix: make Rancher image pre-pull best effort and disable managed SUC
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 32m19s
Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap
requirement. Treat image pre-pull as opportunistic and disable Rancher's
managed system-upgrade-controller feature so that image is removed from the
critical install path while Rancher and its webhook converge.
2026-04-22 11:33:13 +00:00
micqdf 9c0523e880 fix: pre-pull Rancher images and reset Rancher release during bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 27m30s
Rancher installs were stalling on transient Docker Hub TLS handshake timeouts
for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the
required images onto all nodes after k3s comes up, extend the Rancher HelmRelease
timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher
so bootstrap can recover from stale failed remediation state.
2026-04-22 11:00:54 +00:00