HetznerTerra

Author	SHA1	Message	Date
micqdf	098bd98876	fix: wait on Rancher and storage runtime objects during bootstrap Deploy Cluster / Terraform (push) Successful in 26s Details Deploy Cluster / Ansible (push) Failing after 25m19s Details Flux can leave HelmRelease and Kustomization conditions stale after transient chart fetch or image pull failures even when the underlying workloads recover. Switch the deploy workflow to wait on the concrete runtime resources we care about: the NFS provisioner deployment and StorageClass, Rancher deployment, webhook, cert-manager issuer/certificate, and the rancher-backup deployment.	2026-04-22 18:41:09 +00:00
micqdf	9c0523e880	fix: pre-pull Rancher images and reset Rancher release during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 27m30s Details Rancher installs were stalling on transient Docker Hub TLS handshake timeouts for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the required images onto all nodes after k3s comes up, extend the Rancher HelmRelease timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher so bootstrap can recover from stale failed remediation state.	2026-04-22 11:00:54 +00:00
micqdf	8372d562ad	fix: reset and force nfs helmrelease during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 20m22s Details When the NFS storage HelmRelease has already entered a failed remediation state, a plain reconcile request is not enough to clear the stale failure counters. Send requestedAt, resetAt, and forceAt together so helm-controller retries the release cleanly before the workflow waits on addon-nfs-storage.	2026-04-22 10:35:32 +00:00
micqdf	1bb11dfe3a	fix: force nfs storage reconcile during flux bootstrap Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 19m0s Details The NFS HelmRelease can remain in a failed state from an earlier bootstrap attempt even after the backing NFS export is corrected and the pod becomes healthy. Request a fresh reconcile of the HelmRelease and addon kustomization before waiting on addon-nfs-storage so the bootstrap step can observe the recovered state.	2026-04-22 10:08:20 +00:00
micqdf	71bdc6a709	fix: extend Flux bootstrap timeouts on fresh clusters Deploy Cluster / Terraform (push) Successful in 26s Details Deploy Cluster / Ansible (push) Failing after 18m44s Details Fresh Proxmox clusters need longer for the Flux controller rollouts and first GitRepository/Kustomization reconciliations, especially while images are still being pulled onto the control plane. Increase the bootstrap wait windows so CI does not fail while the controllers are still converging.	2026-04-22 08:36:27 +00:00
micqdf	714f20417b	fix: tolerate control-plane taint when pinning Flux to cp1 Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 10m19s Details Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the control-plane node is tainted NoSchedule. Add the matching toleration in both the checked-in patch manifest and the bootstrap workflow so the controllers can actually schedule and roll out on cp-1.	2026-04-22 05:05:15 +00:00
micqdf	b1dae28aa5	feat: migrate cluster baseline from Hetzner to Proxmox Deploy Cluster / Terraform (push) Failing after 52s Details Deploy Cluster / Ansible (push) Has been skipped Details Deploy Grafana Content / Grafana Content (push) Failing after 1m37s Details Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.	2026-04-22 03:02:13 +00:00
micqdf	7385c2263e	fix: add tailnet smoke checks and move Tailscale operator to stable Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 5m55s Details Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness, reachability, and service responses for Rancher, Grafana, and Prometheus. Move the operator to the stable Helm repo/version and align the baseline docs with the current HA private-only architecture.	2026-04-18 19:59:13 +00:00
micqdf	2ba6b6a896	fix: remove unused Flux CLI install from deploy workflow Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 5m40s Details The deploy pipeline never uses the flux binary after installation, so the GitHub release download only adds a flaky failure point. Remove the step and keep the bootstrap path kubectl-only.	2026-04-18 17:45:59 +00:00
micqdf	ceefcc3b29	cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 13m9s Details - Remove ansible/roles/private-access/ (replaced by Tailscale LB services) - Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs) - Remove orphaned kustomization-traefik-config.yaml (no backing directory) - Simplify CI: remove SA patch + job deletion workaround for rancher-backup (now handled by postRenderer in HelmRelease) - Update AGENTS.md to reflect current architecture	2026-04-02 01:21:23 +00:00
micqdf	89e53d9ec9	fix: Handle restricted B2 keys and safe JSON parsing in restore step Deploy Cluster / Terraform (push) Successful in 52s Details Deploy Cluster / Ansible (push) Successful in 20m48s Details	2026-03-31 01:43:04 +00:00
micqdf	5a2551f40a	fix: Fix flux CLI download URL - use correct GitHub URL with v prefix on version Deploy Cluster / Terraform (push) Successful in 51s Details Deploy Cluster / Ansible (push) Failing after 21m52s Details	2026-03-30 03:11:40 +00:00
micqdf	8c7b62c024	feat: Automate Rancher backup restore in CI pipeline Deploy Cluster / Terraform (push) Successful in 2m18s Details Deploy Cluster / Ansible (push) Failing after 6m28s Details - Wait for Rancher and rancher-backup operator to be ready - Patch default SA in cattle-resources-system (fixes post-install hook failure) - Clean up failed patch-sa jobs - Force reconcile rancher-backup HelmRelease - Find latest backup from B2 using Backblaze API - Create Restore CR to restore Rancher state from latest backup - Wait for restore to complete before continuing	2026-03-30 01:56:29 +00:00
micqdf	5269884408	feat: Auto-cleanup stale Tailscale devices before cluster boot Deploy Cluster / Terraform (push) Successful in 2m17s Details Deploy Cluster / Ansible (push) Failing after 6m35s Details Adds tailscale-cleanup Ansible role that uses the Tailscale API to delete offline devices matching reserved hostnames (e.g. rancher). Runs during site.yml before Finalize to prevent hostname collisions like rancher-1 on rebuild. Requires TAILSCALE_API_KEY (API access token) passed as extra var.	2026-03-29 11:47:53 +00:00
micqdf	ff31cb4e74	Implement HA control plane with Load Balancer (3-3 topology) Deploy Cluster / Terraform (push) Failing after 10s Details Deploy Cluster / Ansible (push) Has been skipped Details Major changes: - Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33) - Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API - Terraform: Add kube_api_lb_ip output - Ansible: Add community.network collection to requirements - Ansible: Update inventory to include LB endpoint - Ansible: Configure secondary CPs and workers to join via LB - Ansible: Add k3s_join_endpoint variable for HA joins - Workflow: Add imports for cp-2, cp-3, and worker-3 - Docs: Update STABLE_BASELINE.md with HA topology and phase gates Topology: - 3 control planes (cx23 - 2 vCPU, 8GB RAM each) - 3 workers (cx33 - 4 vCPU, 16GB RAM each) - 1 Load Balancer (lb11) routing to all 3 control planes on port 6443 - Workers and secondary CPs join via LB endpoint for HA Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)	2026-03-23 02:39:39 +00:00
micqdf	cadfedacf1	Fix providerID health check - use shell module for piped grep Deploy Cluster / Terraform (push) Successful in 1m47s Details Deploy Cluster / Ansible (push) Failing after 18m4s Details	2026-03-22 22:55:55 +00:00
micqdf	561cd67b0c	Enable Hetzner CCM and CSI for cloud provider integration Deploy Cluster / Terraform (push) Successful in 30s Details Deploy Cluster / Ansible (push) Failing after 3m21s Details - Enable --kubelet-arg=cloud-provider=external on all nodes (control planes and workers) - Activate CCM Kustomization with 10m timeout for Hetzner cloud-controller-manager - Activate CSI Kustomization with dependsOn CCM and 10m timeout for hcloud-csi - Update deploy workflow to wait for CCM/CSI readiness (600s timeout) - Add providerID verification to post-deploy health checks This enables proper cloud provider integration with Hetzner CCM for node labeling and Hetzner CSI for persistent volume provisioning.	2026-03-22 22:26:21 +00:00
micqdf	7b5d794dfc	fix: update health checks for deferred observability Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-22 01:04:27 +00:00
micqdf	8643bbfc12	fix: defer observability to get clean baseline Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-22 01:03:55 +00:00
micqdf	84f446c2e6	fix: restore observability timeouts to 5 minutes Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Failing after 8m38s Details	2026-03-22 00:43:37 +00:00
micqdf	989848fa89	fix: increase observability timeouts to 10 minutes Deploy Cluster / Terraform (push) Successful in 2m1s Details Deploy Cluster / Ansible (push) Failing after 13m54s Details	2026-03-21 19:34:43 +00:00
micqdf	56e5807474	fix: create doppler ClusterSecretStore after ESO is installed Deploy Cluster / Terraform (push) Successful in 47s Details Deploy Cluster / Ansible (push) Failing after 8m31s Details	2026-03-21 19:19:43 +00:00
micqdf	a01cf435d4	fix: skip ccm/csi waits for stable baseline - using k3s embedded Deploy Cluster / Terraform (push) Successful in 37s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-21 18:40:53 +00:00
micqdf	84f77c4a68	fix: use kubectl patch instead of apply for flux controller nodeSelector Deploy Cluster / Terraform (push) Successful in 38s Details Deploy Cluster / Ansible (push) Failing after 9m41s Details	2026-03-21 18:05:41 +00:00
micqdf	2e4196688c	fix: bootstrap flux in phases - crds first, then resources Deploy Cluster / Terraform (push) Successful in 38s Details Deploy Cluster / Ansible (push) Failing after 3m19s Details	2026-03-21 17:42:39 +00:00
micqdf	fcf7f139ff	fix: use public api endpoint for flux bootstrap Deploy Cluster / Terraform (push) Successful in 41s Details Deploy Cluster / Ansible (push) Failing after 2m16s Details	2026-03-21 00:07:51 +00:00
micqdf	7139ae322d	fix: bootstrap flux during cluster deploy Deploy Cluster / Terraform (push) Successful in 38s Details Deploy Cluster / Ansible (push) Failing after 3m21s Details	2026-03-20 10:37:11 +00:00
micqdf	522626a52b	refactor: simplify stable cluster baseline Deploy Cluster / Terraform (push) Successful in 1m48s Details Deploy Cluster / Ansible (push) Failing after 4m7s Details	2026-03-20 02:24:37 +00:00
micqdf	6f2e056b98	feat: sync runtime secrets from doppler Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Successful in 9m56s Details	2026-03-09 00:25:41 +00:00
micqdf	1c39274df7	feat: stabilize tailscale observability exposure with declarative proxy class Deploy Cluster / Terraform (push) Successful in 54s Details Deploy Cluster / Ansible (push) Successful in 22m19s Details	2026-03-04 01:37:00 +00:00
micqdf	63247b79a6	fix: harden Tailscale operator rollout with preflight and diagnostics Deploy Cluster / Terraform (push) Successful in 47s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-02 21:39:47 +00:00
micqdf	b30977a158	feat: deploy lightweight observability stack via Ansible Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-02 01:33:41 +00:00
micqdf	d92bde78f4	chore: enforce CSI smoke test and add post-deploy health checks Deploy Cluster / Terraform (push) Successful in 42s Details Deploy Cluster / Ansible (push) Failing after 8m20s Details	2026-03-01 23:45:27 +00:00
micqdf	2bc9749b81	feat: switch kubeconfig to tailnet endpoint and deploy Hetzner CSI Deploy Cluster / Terraform (push) Successful in 51s Details Deploy Cluster / Ansible (push) Successful in 3m12s Details	2026-03-01 17:12:12 +00:00
micqdf	54717cccad	fix: allow current CI runner IP through firewall before Ansible Deploy Cluster / Terraform (push) Successful in 35s Details Deploy Cluster / Ansible (push) Successful in 5m13s Details	2026-03-01 14:50:55 +00:00
micqdf	fffd3876fb	fix: remove empty TF_VAR CIDR envs causing plan parse errors Deploy Cluster / Terraform (push) Successful in 39s Details Deploy Cluster / Ansible (push) Failing after 1m28s Details	2026-03-01 14:47:32 +00:00
micqdf	86c38e385f	fix: remove CI tailscale dependency and allow runner CIDR exception Deploy Cluster / Terraform (push) Failing after 31s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-03-01 14:08:08 +00:00
micqdf	d29a428f2d	fix: robust tailscaled startup in CI runner Deploy Cluster / Terraform (push) Successful in 34s Details Deploy Cluster / Ansible (push) Failing after 2m44s Details	2026-03-01 13:57:12 +00:00
micqdf	a8ef173713	fix: start tailscaled daemon before tailscale up in CI Deploy Cluster / Terraform (push) Successful in 35s Details Deploy Cluster / Ansible (push) Failing after 2m15s Details	2026-03-01 13:52:20 +00:00
micqdf	41d0abda16	fix: auto-import existing Hetzner servers into Terraform state in CI Deploy Cluster / Terraform (push) Failing after 21s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-03-01 13:27:02 +00:00
micqdf	011c220f59	fix: avoid server replacement; install tailscale via Ansible Deploy Cluster / Terraform (push) Failing after 22s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-03-01 04:51:19 +00:00
micqdf	1eebfe77df	feat: integrate tailscale access and lock SSH/API to tailnet Deploy Cluster / Terraform (push) Failing after 20s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-03-01 04:04:56 +00:00
micqdf	7230b2b6c8	fix: Use --break-system-packages for pip on Debian 12 Deploy Cluster / Terraform (push) Successful in 20s Details Deploy Cluster / Ansible (push) Failing after 1m12s Details	2026-02-28 22:50:31 +00:00
micqdf	f40a090c7c	fix: Install pip via apt before installing Python packages Deploy Cluster / Terraform (push) Successful in 19s Details Deploy Cluster / Ansible (push) Failing after 22s Details	2026-02-28 22:47:24 +00:00
micqdf	19ba491c54	fix: Use system Python instead of setup-python action Deploy Cluster / Terraform (push) Successful in 21s Details Deploy Cluster / Ansible (push) Failing after 12s Details	2026-02-28 22:45:50 +00:00
micqdf	34c2b6895e	fix: Use Python 3.12 instead of 3.11 Deploy Cluster / Terraform (push) Successful in 18s Details Deploy Cluster / Ansible (push) Failing after 14s Details	2026-02-28 22:44:46 +00:00
micqdf	2fcc8cff77	fix: Ansible fetches outputs directly from Terraform state instead of artifacts Deploy Cluster / Terraform (push) Successful in 19s Details Deploy Cluster / Ansible (push) Failing after 18s Details	2026-02-28 22:43:26 +00:00
micqdf	683f994905	fix: Create outputs directory before saving terraform outputs Deploy Cluster / Terraform (push) Successful in 2m34s Details Deploy Cluster / Ansible (push) Failing after 3m48s Details	2026-02-28 22:27:24 +00:00
micqdf	ebe86cfacf	fix: Typo in chmod path id_ed255 -> id_ed25519 Deploy Cluster / Terraform (push) Failing after 14s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-02-28 21:27:37 +00:00
micqdf	cbd0e0c2c8	fix: Write SSH keys to files before Terraform plan/apply Deploy Cluster / Terraform (push) Failing after 13s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-02-28 21:26:14 +00:00

1 2 3

104 Commits