HetznerTerra

Author	SHA1	Message	Date
micqdf	31e95eb227	fix: pre-pull Flux controllers before bootstrap rollout Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 16m39s Details	2026-04-23 20:36:57 +00:00
micqdf	12675417bd	fix: use correct namespace and deployment name for ESO rollout check Deploy Cluster / Terraform (push) Successful in 1m36s Details Deploy Cluster / Ansible (push) Failing after 40m40s Details The ESO deployment is named external-secrets-external-secrets in the external-secrets namespace, not external-secrets in kube-system.	2026-04-23 19:00:15 +00:00
micqdf	8e081ddfda	fix: wait on ESO deployment directly instead of Flux Kustomization status Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 19m8s Details The addon-external-secrets Flux Kustomization was timing out during bootstrap because image pulls on fresh Proxmox VMs are slow. The critical dependency is the ESO deployment being available for the Doppler ClusterSecretStore. Replace the Kustomization readiness check with direct checks for ESO CRD establishment and deployment rollout, which are the actual prerequisites for the next step.	2026-04-23 07:32:19 +00:00
micqdf	4b7517c9c5	fix: health-check external-secrets addon via HelmRelease only Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 17m22s Details The external-secrets Kustomization was still using wait=true, which makes Flux hold the addon in a failed state when the HelmRepository has transient fetch errors even though the HelmRelease and runtime controller deployments are healthy. Switch it to an explicit HelmRelease health check like the other helm-backed addons.	2026-04-23 07:11:21 +00:00
micqdf	f9bc53723f	fix: make image pre-pull roles fully best effort Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 22m46s Details The pre-pull roles were still blocking the playbook because they retried until success and exhausted their retry budget during registry TLS timeouts. Keep the image pulls as opportunistic cache warmers, but never let them fail the bootstrap; log any missed images instead.	2026-04-23 06:41:21 +00:00
micqdf	ee6417c18e	fix: pre-pull core bootstrap images on cp1 before Flux bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Has been cancelled Details Fresh clusters were repeatedly timing out while kubelet pulled the pause image, k3s packaged component images, and Flux controller images onto the first control plane. Pre-pull the core control-plane bootstrap images into containerd on cp-1 so Flux and packaged addons start from a warm cache instead of racing registry TLS timeouts.	2026-04-23 05:55:14 +00:00
micqdf	1156dc0203	fix: pre-pull kube-vip images before waiting for VIP Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 43m31s Details The primary control plane was stalling because kubelet still had to pull both the Rancher pause image and the kube-vip image before the DaemonSet pod could become Ready. Pre-pull those images into containerd, extend the readiness wait, and emit pod diagnostics if kube-vip still does not come up.	2026-04-23 03:55:52 +00:00
micqdf	4151027e01	fix: clean stale Tailscale node devices before bootstrap Deploy Cluster / Terraform (push) Successful in 1m40s Details Deploy Cluster / Ansible (push) Failing after 14m30s Details Run the Tailscale cleanup role against the cluster hostnames before any node reconnects to the tailnet. This removes stale offline cp/worker devices from previous rebuilds so replacement VMs can reclaim their original hostnames instead of getting -1 suffixes.	2026-04-23 03:25:17 +00:00
micqdf	9269e9df1b	docs: add guide for deploying app repos to the cluster Deploy Cluster / Terraform (push) Successful in 1m36s Details Deploy Cluster / Ansible (push) Has been cancelled Details Document the recommended two-repo model for application delivery, including Flux attachment objects, Doppler/ExternalSecret wiring, Tailscale service exposure, and the steps for enabling the suspended apps layer.	2026-04-23 02:43:00 +00:00
micqdf	d9374bc209	fix: remove duplicate wait keys from helm addon kustomizations Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Has been cancelled Details The repo-only Kustomization healthCheck change accidentally left the original wait:true keys in the Rancher and Rancher backup Kustomizations, which broke the infrastructure kustomize build. Remove the duplicate keys so Flux can apply the HelmRelease-only health checks cleanly.	2026-04-23 02:20:57 +00:00
micqdf	c570a476b5	fix: make helm-based addon kustomizations health-check HelmReleases only Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Has been cancelled Details These addon Kustomizations were using wait=true, which made Flux treat transient HelmRepository fetch timeouts as addon failures even when the HelmRelease and runtime workloads were healthy. Switch the affected Kustomizations to explicit HelmRelease healthChecks so readiness reflects the actual deployed platform state instead of repository fetch flakiness.	2026-04-23 02:15:45 +00:00
micqdf	a7f11ccf94	fix: give Rancher more time to pass startup probe during upgrades Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Successful in 18m59s Details Rancher needs longer than the chart default 2-minute startup probe budget on this cluster while it restores local catalogs and finishes API startup. Extend the startup probe failure threshold so Helm upgrades can complete instead of restarting the new pod before it becomes ready.	2026-04-23 01:44:25 +00:00
micqdf	a7d540ca65	fix: stop forcing Flux releases during deploy bootstrap Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Successful in 21m12s Details Remove the HelmRelease reset/force annotations from the deploy workflow now that the cluster can converge on its own. The runtime waits remain, but CI no longer re-triggers Rancher and NFS churn on every bootstrap attempt.	2026-04-23 00:35:31 +00:00
micqdf	098bd98876	fix: wait on Rancher and storage runtime objects during bootstrap Deploy Cluster / Terraform (push) Successful in 26s Details Deploy Cluster / Ansible (push) Failing after 25m19s Details Flux can leave HelmRelease and Kustomization conditions stale after transient chart fetch or image pull failures even when the underlying workloads recover. Switch the deploy workflow to wait on the concrete runtime resources we care about: the NFS provisioner deployment and StorageClass, Rancher deployment, webhook, cert-manager issuer/certificate, and the rancher-backup deployment.	2026-04-22 18:41:09 +00:00
micqdf	55d7b8201e	fix: make Rancher image pre-pull best effort and disable managed SUC Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 32m19s Details Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap requirement. Treat image pre-pull as opportunistic and disable Rancher's managed system-upgrade-controller feature so that image is removed from the critical install path while Rancher and its webhook converge.	2026-04-22 11:33:13 +00:00
micqdf	9c0523e880	fix: pre-pull Rancher images and reset Rancher release during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 27m30s Details Rancher installs were stalling on transient Docker Hub TLS handshake timeouts for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the required images onto all nodes after k3s comes up, extend the Rancher HelmRelease timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher so bootstrap can recover from stale failed remediation state.	2026-04-22 11:00:54 +00:00
micqdf	8372d562ad	fix: reset and force nfs helmrelease during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 20m22s Details When the NFS storage HelmRelease has already entered a failed remediation state, a plain reconcile request is not enough to clear the stale failure counters. Send requestedAt, resetAt, and forceAt together so helm-controller retries the release cleanly before the workflow waits on addon-nfs-storage.	2026-04-22 10:35:32 +00:00
micqdf	1bb11dfe3a	fix: force nfs storage reconcile during flux bootstrap Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 19m0s Details The NFS HelmRelease can remain in a failed state from an earlier bootstrap attempt even after the backing NFS export is corrected and the pod becomes healthy. Request a fresh reconcile of the HelmRelease and addon kustomization before waiting on addon-nfs-storage so the bootstrap step can observe the recovered state.	2026-04-22 10:08:20 +00:00
micqdf	624cd5aab6	fix: point NFS provisioner at active Proxmox host export Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 18m51s Details The cluster nodes can reach the exported NFS path on 10.27.27.239, not 10.27.27.22. Update the storage addon and repo note so the NFS provisioner mounts the live export and Flux health checks can converge.	2026-04-22 09:46:01 +00:00
micqdf	71bdc6a709	fix: extend Flux bootstrap timeouts on fresh clusters Deploy Cluster / Terraform (push) Successful in 26s Details Deploy Cluster / Ansible (push) Failing after 18m44s Details Fresh Proxmox clusters need longer for the Flux controller rollouts and first GitRepository/Kustomization reconciliations, especially while images are still being pulled onto the control plane. Increase the bootstrap wait windows so CI does not fail while the controllers are still converging.	2026-04-22 08:36:27 +00:00
micqdf	714f20417b	fix: tolerate control-plane taint when pinning Flux to cp1 Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 10m19s Details Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the control-plane node is tainted NoSchedule. Add the matching toleration in both the checked-in patch manifest and the bootstrap workflow so the controllers can actually schedule and roll out on cp-1.	2026-04-22 05:05:15 +00:00
micqdf	c32bec34bc	fix: quote kube-vip readiness jsonpath in bootstrap role Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 10m11s Details The local kube-vip readiness probe used an unquoted jsonpath predicate, which made kubectl treat Ready as an identifier instead of a string. Use a quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod properly before waiting on the API VIP.	2026-04-22 04:41:48 +00:00
micqdf	6519a7673d	fix: wait for kube-vip on primary node during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 9m11s Details The kube-vip DaemonSet is applied before the secondary control planes join, so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not exist in the cluster yet. Wait only for the primary node's kube-vip pod and then verify the VIP is reachable on 6443.	2026-04-22 04:29:29 +00:00
micqdf	d1c31cdb91	fix: rely on k3s service readiness instead of installer exit code Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 8m9s Details The k3s install script can return non-zero while systemd is still bringing the service up, especially on worker agents. Do not fail immediately on the installer command; wait for the service to become active and only emit install diagnostics if the later readiness check fails.	2026-04-22 04:14:31 +00:00
micqdf	b3e88712bd	fix: derive cluster network interface from host facts Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 12m32s Details The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18. Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap tracks the actual interface name instead of a guessed template default.	2026-04-22 03:50:03 +00:00
micqdf	06366ee5e6	fix: accept cloud-init exit code 2 after first boot Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 6m2s Details Ubuntu cloud-init returns exit code 2 for some completed boots even when the status output is 'done'. Treat that as a successful wait state so Ansible can continue into the package install phase instead of aborting early.	2026-04-22 03:40:55 +00:00
micqdf	9a2d213114	fix: wait for cloud-init before package install during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 2m36s Details Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot, which caused the Ansible common role to fail before the control plane could finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the final kubeconfig rewrite when no kubeconfig was fetched yet.	2026-04-22 03:34:53 +00:00
micqdf	9482a0f551	fix: skip clone storage override for linked Proxmox clones Deploy Cluster / Terraform (push) Successful in 1m43s Details Deploy Cluster / Ansible (push) Failing after 6m24s Details The bpg/proxmox provider rejects clone.datastore_id when creating linked clones. Only pass the target datastore when full clones are enabled so the linked-clone baseline can provision from template 9000 successfully.	2026-04-22 03:22:50 +00:00
micqdf	5c53b8e06e	fix: normalize Proxmox endpoint and stop dashboards self-trigger Deploy Cluster / Terraform (push) Failing after 53s Details Deploy Cluster / Ansible (push) Has been skipped Details Accept Proxmox API endpoints with or without /api2/json in CI and local tfvars, and avoid running the dashboards workflow just because its own workflow file changed during platform migrations.	2026-04-22 03:13:22 +00:00
micqdf	b1dae28aa5	feat: migrate cluster baseline from Hetzner to Proxmox Deploy Cluster / Terraform (push) Failing after 52s Details Deploy Cluster / Ansible (push) Has been skipped Details Deploy Grafana Content / Grafana Content (push) Failing after 1m37s Details Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.	2026-04-22 03:02:13 +00:00
micqdf	6c6b9d20ca	update README Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-04-22 01:14:21 +00:00
micqdf	c3a2f25c94	docs: record validated Rancher restore drill Deploy Cluster / Terraform (push) Successful in 2m11s Details Deploy Cluster / Ansible (push) Successful in 10m9s Details Update the baseline to treat Rancher backup and restore validation as part of the accepted platform state, and capture the successful live drill run performed on 2026-04-18.	2026-04-18 21:27:42 +00:00
micqdf	7385c2263e	fix: add tailnet smoke checks and move Tailscale operator to stable Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 5m55s Details Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness, reachability, and service responses for Rancher, Grafana, and Prometheus. Move the operator to the stable Helm repo/version and align the baseline docs with the current HA private-only architecture.	2026-04-18 19:59:13 +00:00
micqdf	60f466ab98	remove Weave GitOps addon Deploy Cluster / Terraform (push) Successful in 41s Details Deploy Cluster / Ansible (push) Successful in 5m37s Details Drop the Flux UI addon and its Tailscale exposure because the UI lags the current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed through the controllers themselves and use Rancher or the flux CLI for access.	2026-04-18 18:44:55 +00:00
micqdf	b20356e9fe	fix: only clean stale Tailscale names before proxies exist Deploy Cluster / Terraform (push) Failing after 51s Details Deploy Cluster / Ansible (push) Has been skipped Details The Tailscale cleanup role was deleting reserved service hostnames on later deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy nodes from the tailnet. Skip cleanup whenever the current cluster already has those Tailscale services, while still allowing cleanup on fresh rebuilds.	2026-04-18 18:16:27 +00:00
micqdf	2ba6b6a896	fix: remove unused Flux CLI install from deploy workflow Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 5m40s Details The deploy pipeline never uses the flux binary after installation, so the GitHub release download only adds a flaky failure point. Remove the step and keep the bootstrap path kubectl-only.	2026-04-18 17:45:59 +00:00
micqdf	9126de1423	fix: Align Prometheus external URL with Tailscale service port Deploy Cluster / Terraform (push) Successful in 48s Details Deploy Cluster / Ansible (push) Failing after 4m52s Details Prometheus is exposed on port 9090 through the Tailscale LoadBalancer service, so the configured external URL and repo docs should match the actual address users reach after rebuilds.	2026-04-18 17:11:16 +00:00
micqdf	4532b9ed74	chore: trigger rebuild Deploy Cluster / Terraform (push) Successful in 2m8s Details Deploy Cluster / Ansible (push) Successful in 12m54s Details	2026-04-18 06:09:54 +00:00
micqdf	68dbd2e5b7	fix: Reserve Tailscale service hostnames and tag exposed proxies Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 6m3s Details Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale services so operator-managed proxies are provisioned with explicit prod/service tags from the tailnet policy.	2026-04-18 05:48:26 +00:00
micqdf	ceefcc3b29	cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 13m9s Details - Remove ansible/roles/private-access/ (replaced by Tailscale LB services) - Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs) - Remove orphaned kustomization-traefik-config.yaml (no backing directory) - Simplify CI: remove SA patch + job deletion workaround for rancher-backup (now handled by postRenderer in HelmRelease) - Update AGENTS.md to reflect current architecture	2026-04-02 01:21:23 +00:00
micqdf	0d339b3163	fix: Use rancher/kubectl image for rancher-backup hook Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 5m41s Details bitnami/kubectl:1.34 tag doesn't exist. rancher/kubectl is already available in the cluster's image cache.	2026-04-02 01:00:27 +00:00
micqdf	30ccf13c82	fix: Use postRenderer to replace broken kuberlr-kubectl image in rancher-backup hook Deploy Cluster / Terraform (push) Successful in 55s Details Deploy Cluster / Ansible (push) Has been cancelled Details The chart's post-install hook hardcodes rancher/kuberlr-kubectl which can't download kubectl. Use Flux postRenderers to patch the job image to bitnami/kubectl at render time.	2026-04-02 00:51:50 +00:00
micqdf	75e3604f30	fix: Skip post-install hooks for rancher-backup HelmRelease Deploy Cluster / Terraform (push) Successful in 57s Details Deploy Cluster / Ansible (push) Has been cancelled Details The chart's post-install hook uses rancher/kuberlr-kubectl which fails to download kubectl. The SA automountServiceAccountToken is managed manually, so the hook is unnecessary.	2026-04-02 00:45:03 +00:00
micqdf	e4235a6e58	fix: Correct Flux UI pod selector labels to match deployed weave-gitops labels Deploy Cluster / Terraform (push) Successful in 51s Details Deploy Cluster / Ansible (push) Successful in 20m36s Details Actual labels are app.kubernetes.io/name=weave-gitops and app.kubernetes.io/instance=flux-system-weave-gitops.	2026-04-01 02:08:12 +00:00
micqdf	ea2d534171	fix: Use admin.existingSecret for Grafana creds from Doppler Deploy Cluster / Terraform (push) Successful in 50s Details Deploy Cluster / Ansible (push) Successful in 20m42s Details Revert to idiomatic Grafana chart approach. ExternalSecret creates the secret with admin-user/admin-password keys before Grafana's first start on fresh cluster creation.	2026-04-01 01:41:49 +00:00
micqdf	a1b9fe6aa6	fix: Use Flux valuesFrom to inject Doppler Grafana creds as Helm values Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 20m38s Details Switch from admin.existingSecret to valuesFrom so Flux reads the Doppler-managed secret and injects credentials as standard Helm values.	2026-03-31 23:40:54 +00:00
micqdf	33765657ec	fix: Correct pod selectors for Prometheus and Flux Tailscale services, use Doppler for Grafana creds Deploy Cluster / Terraform (push) Successful in 50s Details Deploy Cluster / Ansible (push) Successful in 21m0s Details Prometheus needs operator.prometheus.io/name label selector. Flux UI pods are labeled gitops-server not weave-gitops. Grafana now reads admin creds from Doppler via ExternalSecret instead of hardcoded values.	2026-03-31 22:54:57 +00:00
micqdf	b8f64fa952	feat: Expose Grafana, Prometheus, and Flux UI via Tailscale LoadBalancer services Deploy Cluster / Terraform (push) Successful in 55s Details Deploy Cluster / Ansible (push) Successful in 20m47s Details Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB services matching the existing Rancher pattern. Each service gets its own tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).	2026-03-31 08:53:28 +00:00
micqdf	569d741751	push Deploy Cluster / Terraform (push) Successful in 2m37s Details Deploy Cluster / Ansible (push) Successful in 25m37s Details	2026-03-31 02:46:55 +00:00
micqdf	89e53d9ec9	fix: Handle restricted B2 keys and safe JSON parsing in restore step Deploy Cluster / Terraform (push) Successful in 52s Details Deploy Cluster / Ansible (push) Successful in 20m48s Details	2026-03-31 01:43:04 +00:00

1 2 3 4 5 ...

351 Commits