HetznerTerra

Author	SHA1	Message	Date
micqdf	55d7b8201e	fix: make Rancher image pre-pull best effort and disable managed SUC Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 32m19s Details Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap requirement. Treat image pre-pull as opportunistic and disable Rancher's managed system-upgrade-controller feature so that image is removed from the critical install path while Rancher and its webhook converge.	2026-04-22 11:33:13 +00:00
micqdf	9c0523e880	fix: pre-pull Rancher images and reset Rancher release during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 27m30s Details Rancher installs were stalling on transient Docker Hub TLS handshake timeouts for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the required images onto all nodes after k3s comes up, extend the Rancher HelmRelease timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher so bootstrap can recover from stale failed remediation state.	2026-04-22 11:00:54 +00:00
micqdf	c32bec34bc	fix: quote kube-vip readiness jsonpath in bootstrap role Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 10m11s Details The local kube-vip readiness probe used an unquoted jsonpath predicate, which made kubectl treat Ready as an identifier instead of a string. Use a quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod properly before waiting on the API VIP.	2026-04-22 04:41:48 +00:00
micqdf	6519a7673d	fix: wait for kube-vip on primary node during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 9m11s Details The kube-vip DaemonSet is applied before the secondary control planes join, so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not exist in the cluster yet. Wait only for the primary node's kube-vip pod and then verify the VIP is reachable on 6443.	2026-04-22 04:29:29 +00:00
micqdf	d1c31cdb91	fix: rely on k3s service readiness instead of installer exit code Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 8m9s Details The k3s install script can return non-zero while systemd is still bringing the service up, especially on worker agents. Do not fail immediately on the installer command; wait for the service to become active and only emit install diagnostics if the later readiness check fails.	2026-04-22 04:14:31 +00:00
micqdf	b3e88712bd	fix: derive cluster network interface from host facts Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 12m32s Details The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18. Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap tracks the actual interface name instead of a guessed template default.	2026-04-22 03:50:03 +00:00
micqdf	06366ee5e6	fix: accept cloud-init exit code 2 after first boot Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 6m2s Details Ubuntu cloud-init returns exit code 2 for some completed boots even when the status output is 'done'. Treat that as a successful wait state so Ansible can continue into the package install phase instead of aborting early.	2026-04-22 03:40:55 +00:00
micqdf	9a2d213114	fix: wait for cloud-init before package install during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 2m36s Details Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot, which caused the Ansible common role to fail before the control plane could finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the final kubeconfig rewrite when no kubeconfig was fetched yet.	2026-04-22 03:34:53 +00:00
micqdf	b1dae28aa5	feat: migrate cluster baseline from Hetzner to Proxmox Deploy Cluster / Terraform (push) Failing after 52s Details Deploy Cluster / Ansible (push) Has been skipped Details Deploy Grafana Content / Grafana Content (push) Failing after 1m37s Details Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.	2026-04-22 03:02:13 +00:00
micqdf	ceefcc3b29	cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 13m9s Details - Remove ansible/roles/private-access/ (replaced by Tailscale LB services) - Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs) - Remove orphaned kustomization-traefik-config.yaml (no backing directory) - Simplify CI: remove SA patch + job deletion workaround for rancher-backup (now handled by postRenderer in HelmRelease) - Update AGENTS.md to reflect current architecture	2026-04-02 01:21:23 +00:00
micqdf	efdf13976a	fix: Handle missing 'online' field in Tailscale API response Deploy Cluster / Terraform (push) Successful in 2m12s Details Deploy Cluster / Ansible (push) Successful in 9m19s Details	2026-03-29 13:52:23 +00:00
micqdf	5269884408	feat: Auto-cleanup stale Tailscale devices before cluster boot Deploy Cluster / Terraform (push) Successful in 2m17s Details Deploy Cluster / Ansible (push) Failing after 6m35s Details Adds tailscale-cleanup Ansible role that uses the Tailscale API to delete offline devices matching reserved hostnames (e.g. rancher). Runs during site.yml before Finalize to prevent hostname collisions like rancher-1 on rebuild. Requires TAILSCALE_API_KEY (API access token) passed as extra var.	2026-03-29 11:47:53 +00:00
micqdf	f36445d99a	Fix CNI: configure flannel to use private network interface (enp7s0) instead of public Deploy Cluster / Terraform (push) Successful in 34s Details Deploy Cluster / Ansible (push) Successful in 8m42s Details	2026-03-25 01:44:33 +00:00
micqdf	89c2c99963	Fix Rancher: remove conflicting LoadBalancer, add HTTPS port-forward, use tailscale serve only Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 9m2s Details	2026-03-25 00:59:16 +00:00
micqdf	ab2f287bfb	Fix Rancher: use correct service name cattle-system-rancher Deploy Cluster / Terraform (push) Successful in 39s Details Deploy Cluster / Ansible (push) Successful in 4m23s Details	2026-03-24 22:30:49 +00:00
micqdf	60ceac4624	Fix Rancher access: add kubectl port-forward + tailscale serve setup Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-24 20:01:57 +00:00
micqdf	0e52d8f159	Use Tailscale DNS names instead of IPs for TLS SANs Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 9m0s Details Changed from hardcoded Tailscale IPs to DNS names: - k8s-cluster-cp-1.silverside-gopher.ts.net - k8s-cluster-cp-2.silverside-gopher.ts.net - k8s-cluster-cp-3.silverside-gopher.ts.net This is more robust since Tailscale IPs change on rebuild, but DNS names remain consistent. After next rebuild, cluster accessible via: - kubectl --server=https://k8s-cluster-cp-1.silverside-gopher.ts.net:6443	2026-03-23 23:50:48 +00:00
micqdf	4726db2b5b	Add Tailscale IPs to k3s TLS SANs for secure tailnet access Deploy Cluster / Terraform (push) Successful in 2m30s Details Deploy Cluster / Ansible (push) Successful in 9m48s Details Changes: - Add tailscale_control_plane_ips list to k3s-server defaults - Include all 3 control plane Tailscale IPs (100.120.55.97, 100.108.90.123, 100.92.149.85) - Update primary k3s install to add Tailscale IPs to TLS certificates - Enables kubectl access via Tailscale without certificate errors After next deploy, cluster will be accessible via: - kubectl --server=https://100.120.55.97:6443 (or any CP tailscale IP) - kubectl --server=https://k8s-cluster-cp-1:6443 (via tailscale DNS)	2026-03-23 23:04:00 +00:00
micqdf	90d105e5ea	Fix kube_api_endpoint variable passing for HA cluster Deploy Cluster / Terraform (push) Successful in 2m18s Details Deploy Cluster / Ansible (push) Successful in 8m55s Details - Remove circular variable reference in site.yml - Add kube_api_endpoint default to k3s-server role - Variable is set via inventory group_vars and passed to role - Primary CP now correctly adds LB IP to TLS SANs Note: Existing cluster needs destroy/rebuild to regenerate certificates.	2026-03-23 03:01:53 +00:00
micqdf	952a80a742	Fix HA cluster join via Load Balancer private IP Deploy Cluster / Terraform (push) Successful in 36s Details Deploy Cluster / Ansible (push) Failing after 3m5s Details Changes: - Use LB private IP (10.0.1.5) instead of public IP for cluster joins - Add LB private IP to k3s TLS SANs on primary control plane - This allows secondary CPs and workers to verify certificates when joining via LB Fixes x509 certificate validation error when joining via LB public IP.	2026-03-23 02:56:41 +00:00
micqdf	ff31cb4e74	Implement HA control plane with Load Balancer (3-3 topology) Deploy Cluster / Terraform (push) Failing after 10s Details Deploy Cluster / Ansible (push) Has been skipped Details Major changes: - Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33) - Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API - Terraform: Add kube_api_lb_ip output - Ansible: Add community.network collection to requirements - Ansible: Update inventory to include LB endpoint - Ansible: Configure secondary CPs and workers to join via LB - Ansible: Add k3s_join_endpoint variable for HA joins - Workflow: Add imports for cp-2, cp-3, and worker-3 - Docs: Update STABLE_BASELINE.md with HA topology and phase gates Topology: - 3 control planes (cx23 - 2 vCPU, 8GB RAM each) - 3 workers (cx33 - 4 vCPU, 16GB RAM each) - 1 Load Balancer (lb11) routing to all 3 control planes on port 6443 - Workers and secondary CPs join via LB endpoint for HA Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)	2026-03-23 02:39:39 +00:00
micqdf	e447795395	Install helm binary in ccm-deploy role before using it Deploy Cluster / Terraform (push) Successful in 2m1s Details Deploy Cluster / Ansible (push) Successful in 6m35s Details The kubernetes.core.helm module requires helm CLI to be installed on the target node. Added check and install step using the official helm install script.	2026-03-23 00:07:39 +00:00
micqdf	31b82c9371	Deploy CCM via Ansible before workers join to fix external cloud provider Deploy Cluster / Terraform (push) Successful in 31s Details Deploy Cluster / Ansible (push) Failing after 1m48s Details This fixes the chicken-and-egg problem where workers with --kubelet-arg=cloud-provider=external couldn't join because CCM wasn't running yet to remove the node.cloudprovider.kubernetes.io/uninitialized taint. Changes: - Create ansible/roles/ccm-deploy/ to deploy CCM via Helm during Ansible phase - Reorder site.yml: CCM deploys after secrets but before workers join - CCM runs on control_plane[0] with proper tolerations for control plane nodes - Add 10s pause after CCM ready to ensure it can process new nodes - Workers can now successfully join with external cloud provider enabled Flux still manages CCM for updates, but initial install happens in Ansible.	2026-03-22 23:58:03 +00:00
micqdf	561cd67b0c	Enable Hetzner CCM and CSI for cloud provider integration Deploy Cluster / Terraform (push) Successful in 30s Details Deploy Cluster / Ansible (push) Failing after 3m21s Details - Enable --kubelet-arg=cloud-provider=external on all nodes (control planes and workers) - Activate CCM Kustomization with 10m timeout for Hetzner cloud-controller-manager - Activate CSI Kustomization with dependsOn CCM and 10m timeout for hcloud-csi - Update deploy workflow to wait for CCM/CSI readiness (600s timeout) - Add providerID verification to post-deploy health checks This enables proper cloud provider integration with Hetzner CCM for node labeling and Hetzner CSI for persistent volume provisioning.	2026-03-22 22:26:21 +00:00
micqdf	8d1f9f4944	fix: add k3s reset logic for primary control plane Deploy Cluster / Terraform (push) Successful in 39s Details Deploy Cluster / Ansible (push) Failing after 4m19s Details	2026-03-21 16:10:17 +00:00
micqdf	d4fd43e2f5	refactor: simplify k3s-server bootstrap for	2026-03-21 15:48:33 +00:00
micqdf	48a80c362c	fix: disable external cloud-provider kubelet arg for stable baseline Deploy Cluster / Terraform (push) Successful in 50s Details Deploy Cluster / Ansible (push) Failing after 4m21s Details	2026-03-21 14:36:54 +00:00
micqdf	528a8dc210	fix: defer doppler store until eso is installed Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Failing after 24m34s Details	2026-03-20 09:30:17 +00:00
micqdf	349f75729a	fix: bootstrap tailscale namespace before secret Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Failing after 3m30s Details	2026-03-20 09:24:35 +00:00
micqdf	5bd4c41c2d	fix: restore k3s agent bootstrap Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Failing after 18m16s Details	2026-03-20 01:50:16 +00:00
micqdf	9d2f30de32	fix: prepare k3s for external cloud provider Deploy Cluster / Terraform (push) Successful in 46s Details Deploy Cluster / Ansible (push) Successful in 4m4s Details	2026-03-17 01:21:23 +00:00
micqdf	08a3031276	refactor: retire imperative addon roles Deploy Cluster / Terraform (push) Successful in 52s Details Deploy Cluster / Ansible (push) Successful in 4m2s Details	2026-03-17 01:04:02 +00:00
micqdf	bed8e4afc8	feat: migrate core addons toward flux Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 4m6s Details	2026-03-11 17:43:35 +00:00
micqdf	2d4de6cff8	fix: bootstrap doppler store outside flux Deploy Cluster / Terraform (push) Successful in 43s Details Deploy Cluster / Ansible (push) Successful in 9m42s Details	2026-03-09 02:58:26 +00:00
micqdf	6f2e056b98	feat: sync runtime secrets from doppler Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Successful in 9m56s Details	2026-03-09 00:25:41 +00:00
micqdf	f95e0051a5	feat: automate private tailnet access on cp1 Deploy Cluster / Terraform (push) Successful in 47s Details Deploy Cluster / Ansible (push) Successful in 9m45s Details	2026-03-08 04:16:06 +00:00
micqdf	480a079dc8	fix: fail fast when loki datasource has no labels Deploy Grafana Content / Grafana Content (push) Successful in 1m59s Details Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Successful in 22m51s Details	2026-03-04 21:00:01 +00:00
micqdf	ff8e32daf5	fix: add loki nodeport fallback for grafana datasource reachability Deploy Grafana Content / Grafana Content (push) Successful in 2m18s Details Deploy Cluster / Terraform (push) Successful in 48s Details Deploy Cluster / Ansible (push) Successful in 22m59s Details	2026-03-04 19:39:16 +00:00
micqdf	eb1ad0bea7	fix: make grafana prometheus datasource resilient with nodeport fallback Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Grafana Content / Grafana Content (push) Successful in 1m46s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-04 19:22:31 +00:00
micqdf	9ff9d1e633	fix: clear stale helm pending revisions before kube-prometheus upgrade Deploy Cluster / Terraform (push) Successful in 43s Details Deploy Cluster / Ansible (push) Successful in 22m22s Details	2026-03-04 18:35:55 +00:00
micqdf	6177b581e4	fix: correct dashboard verification checks and retry helm upgrade lock Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Grafana Content / Grafana Content (push) Successful in 1m29s Details Deploy Cluster / Ansible (push) Failing after 11m11s Details	2026-03-04 08:48:30 +00:00
micqdf	2f166ed9e7	feat: manage grafana content as code with fast dashboard workflow Deploy Cluster / Terraform (push) Successful in 46s Details Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Grafana Content / Grafana Content (push) Has been cancelled Details	2026-03-04 03:36:01 +00:00
micqdf	1c39274df7	feat: stabilize tailscale observability exposure with declarative proxy class Deploy Cluster / Terraform (push) Successful in 54s Details Deploy Cluster / Ansible (push) Successful in 22m19s Details	2026-03-04 01:37:00 +00:00
micqdf	28eaa36ec4	fix: use tag:k8s for tailscale operator default tags Deploy Cluster / Terraform (push) Successful in 55s Details Deploy Cluster / Ansible (push) Successful in 24m25s Details	2026-03-04 00:57:33 +00:00
micqdf	02fa71c0aa	fix: use tag:k8 for tailscale operator default tag Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Successful in 23m16s Details	2026-03-04 00:27:47 +00:00
micqdf	2bbf05cdca	fix: make tailscale operator non-blocking by default and gate observability patching on readiness Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Successful in 22m44s Details	2026-03-03 21:47:16 +00:00
micqdf	213c1fb4e4	fix: detect tailscale tag permission errors and clean access output Deploy Cluster / Terraform (push) Successful in 46s Details Deploy Cluster / Ansible (push) Failing after 14m7s Details	2026-03-03 08:51:25 +00:00
micqdf	414ac73c25	fix: fail fast on tailscale oauth 403 with actionable message Deploy Cluster / Terraform (push) Successful in 46s Details Deploy Cluster / Ansible (push) Successful in 27m37s Details	2026-03-02 23:57:53 +00:00
micqdf	542d7a6be5	fix: align tailscale proxy tags with operator tags Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Failing after 19m38s Details	2026-03-02 23:36:18 +00:00
micqdf	210b617cc9	fix: pin tailscale operator to control-plane node for DNS stability Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-02 23:32:36 +00:00

1 2 3

108 Commits