HetznerTerra

Author	SHA1	Message	Date
micqdf	69100bac44	fix: retry transient first boot ssh disconnects Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Failing after 27m20s Details	2026-05-03 05:47:33 +00:00
micqdf	3dbba22a6d	fix: use mirror-aware crictl prepulls Deploy Cluster / Terraform (push) Successful in 33s Details Deploy Cluster / Ansible (push) Failing after 59m33s Details	2026-05-03 01:50:22 +00:00
micqdf	600aa4787d	fix: tolerate completed cloud-init package errors Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Failing after 37m53s Details	2026-05-03 00:36:42 +00:00
micqdf	1896108cbb	fix: add local registry cache for rebuilds Deploy Cluster / Terraform (push) Successful in 4m7s Details Deploy Cluster / Ansible (push) Failing after 16m31s Details	2026-05-03 00:02:33 +00:00
micqdf	63c45337a0	fix: reuse pre-pulled busybox for nfs smoke Deploy Cluster / Terraform (push) Successful in 35s Details Deploy Cluster / Ansible (push) Failing after 37m32s Details	2026-05-02 03:55:12 +00:00
micqdf	66a550c830	fix: pre-pull external secrets image Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Failing after 16m21s Details	2026-05-02 03:16:18 +00:00
micqdf	d78867e4d6	fix: retry transient registry pulls Deploy Cluster / Terraform (push) Successful in 33s Details Deploy Cluster / Ansible (push) Failing after 28m31s Details	2026-05-02 02:42:58 +00:00
micqdf	0874553582	fix: lower node mtu for registry egress Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Failing after 20m31s Details	2026-05-02 01:40:26 +00:00
micqdf	0aba186d8b	fix: gate workers on kube-vip reachability Deploy Cluster / Terraform (push) Successful in 33s Details Deploy Cluster / Ansible (push) Failing after 15m7s Details	2026-05-02 01:04:06 +00:00
micqdf	17182f84a9	fix: remove runner image archive path Deploy Cluster / Terraform (push) Successful in 4m16s Details Deploy Cluster / Ansible (push) Failing after 13m57s Details	2026-05-02 00:41:25 +00:00
micqdf	6d6e3e8371	fix: import runner image archives during prepull Deploy Cluster / Terraform (push) Successful in 32s Details Deploy Cluster / Ansible (push) Failing after 46m46s Details	2026-04-30 09:08:44 +00:00
micqdf	a33a993867	fix: harden cluster rebuild determinism Deploy Grafana Content / Grafana Content (push) Failing after 1m14s Details Deploy Cluster / Terraform (push) Failing after 4m59s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-04-30 07:36:27 +00:00
micqdf	f52e657f9f	docs	2026-04-30 07:03:21 +00:00
micqdf	f49b08f50c	fix: reinstall k3s on version drift Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 33m40s Details	2026-04-30 06:03:53 +00:00
micqdf	327bb860b7	fix: pin k3s below rancher limit Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 35m0s Details	2026-04-30 05:23:37 +00:00
micqdf	fd5451a5ef	fix: wait for ssh before gathering facts Deploy Cluster / Terraform (push) Successful in 30s Details Deploy Cluster / Ansible (push) Failing after 1h13m38s Details	2026-04-30 03:44:13 +00:00
micqdf	14462dd870	fix: avoid resetting healthy observability Deploy Cluster / Terraform (push) Successful in 33s Details Deploy Cluster / Ansible (push) Successful in 23m12s Details	2026-04-26 20:25:42 +00:00
micqdf	440e268e4f	fix: seed kube-vip image from runner Deploy Cluster / Terraform (push) Failing after 1m56s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-04-26 04:28:21 +00:00
micqdf	46b2ff7d19	fix: harden final health checks Deploy Cluster / Terraform (push) Successful in 31s Details Deploy Cluster / Ansible (push) Failing after 17m50s Details	2026-04-26 02:14:02 +00:00
micqdf	e56a3a6c38	fix: wait for ESO webhook before ClusterSecretStore Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 10m13s Details	2026-04-24 23:13:03 +00:00
micqdf	347ca041ba	fix: reduce rerun bootstrap pre-pull delays Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 39m26s Details	2026-04-24 12:09:34 +00:00
micqdf	3f52bad854	fix: make Ansible reruns faster and idempotent Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-04-24 11:44:11 +00:00
micqdf	c89c31adea	fix: clean up Ansible bootstrap warnings Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-04-24 11:07:13 +00:00
micqdf	68b293efe4	fix: qualify Flux HelmChart bootstrap resources Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-04-24 10:47:13 +00:00
micqdf	f9bc53723f	fix: make image pre-pull roles fully best effort Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 22m46s Details The pre-pull roles were still blocking the playbook because they retried until success and exhausted their retry budget during registry TLS timeouts. Keep the image pulls as opportunistic cache warmers, but never let them fail the bootstrap; log any missed images instead.	2026-04-23 06:41:21 +00:00
micqdf	ee6417c18e	fix: pre-pull core bootstrap images on cp1 before Flux bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Has been cancelled Details Fresh clusters were repeatedly timing out while kubelet pulled the pause image, k3s packaged component images, and Flux controller images onto the first control plane. Pre-pull the core control-plane bootstrap images into containerd on cp-1 so Flux and packaged addons start from a warm cache instead of racing registry TLS timeouts.	2026-04-23 05:55:14 +00:00
micqdf	1156dc0203	fix: pre-pull kube-vip images before waiting for VIP Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 43m31s Details The primary control plane was stalling because kubelet still had to pull both the Rancher pause image and the kube-vip image before the DaemonSet pod could become Ready. Pre-pull those images into containerd, extend the readiness wait, and emit pod diagnostics if kube-vip still does not come up.	2026-04-23 03:55:52 +00:00
micqdf	4151027e01	fix: clean stale Tailscale node devices before bootstrap Deploy Cluster / Terraform (push) Successful in 1m40s Details Deploy Cluster / Ansible (push) Failing after 14m30s Details Run the Tailscale cleanup role against the cluster hostnames before any node reconnects to the tailnet. This removes stale offline cp/worker devices from previous rebuilds so replacement VMs can reclaim their original hostnames instead of getting -1 suffixes.	2026-04-23 03:25:17 +00:00
micqdf	55d7b8201e	fix: make Rancher image pre-pull best effort and disable managed SUC Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 32m19s Details Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap requirement. Treat image pre-pull as opportunistic and disable Rancher's managed system-upgrade-controller feature so that image is removed from the critical install path while Rancher and its webhook converge.	2026-04-22 11:33:13 +00:00
micqdf	9c0523e880	fix: pre-pull Rancher images and reset Rancher release during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 27m30s Details Rancher installs were stalling on transient Docker Hub TLS handshake timeouts for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the required images onto all nodes after k3s comes up, extend the Rancher HelmRelease timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher so bootstrap can recover from stale failed remediation state.	2026-04-22 11:00:54 +00:00
micqdf	c32bec34bc	fix: quote kube-vip readiness jsonpath in bootstrap role Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 10m11s Details The local kube-vip readiness probe used an unquoted jsonpath predicate, which made kubectl treat Ready as an identifier instead of a string. Use a quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod properly before waiting on the API VIP.	2026-04-22 04:41:48 +00:00
micqdf	6519a7673d	fix: wait for kube-vip on primary node during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 9m11s Details The kube-vip DaemonSet is applied before the secondary control planes join, so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not exist in the cluster yet. Wait only for the primary node's kube-vip pod and then verify the VIP is reachable on 6443.	2026-04-22 04:29:29 +00:00
micqdf	d1c31cdb91	fix: rely on k3s service readiness instead of installer exit code Deploy Cluster / Terraform (push) Successful in 27s Details Deploy Cluster / Ansible (push) Failing after 8m9s Details The k3s install script can return non-zero while systemd is still bringing the service up, especially on worker agents. Do not fail immediately on the installer command; wait for the service to become active and only emit install diagnostics if the later readiness check fails.	2026-04-22 04:14:31 +00:00
micqdf	b3e88712bd	fix: derive cluster network interface from host facts Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 12m32s Details The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18. Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap tracks the actual interface name instead of a guessed template default.	2026-04-22 03:50:03 +00:00
micqdf	06366ee5e6	fix: accept cloud-init exit code 2 after first boot Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 6m2s Details Ubuntu cloud-init returns exit code 2 for some completed boots even when the status output is 'done'. Treat that as a successful wait state so Ansible can continue into the package install phase instead of aborting early.	2026-04-22 03:40:55 +00:00
micqdf	9a2d213114	fix: wait for cloud-init before package install during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 2m36s Details Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot, which caused the Ansible common role to fail before the control plane could finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the final kubeconfig rewrite when no kubeconfig was fetched yet.	2026-04-22 03:34:53 +00:00
micqdf	b1dae28aa5	feat: migrate cluster baseline from Hetzner to Proxmox Deploy Cluster / Terraform (push) Failing after 52s Details Deploy Cluster / Ansible (push) Has been skipped Details Deploy Grafana Content / Grafana Content (push) Failing after 1m37s Details Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.	2026-04-22 03:02:13 +00:00
micqdf	b20356e9fe	fix: only clean stale Tailscale names before proxies exist Deploy Cluster / Terraform (push) Failing after 51s Details Deploy Cluster / Ansible (push) Has been skipped Details The Tailscale cleanup role was deleting reserved service hostnames on later deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy nodes from the tailnet. Skip cleanup whenever the current cluster already has those Tailscale services, while still allowing cleanup on fresh rebuilds.	2026-04-18 18:16:27 +00:00
micqdf	68dbd2e5b7	fix: Reserve Tailscale service hostnames and tag exposed proxies Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 6m3s Details Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale services so operator-managed proxies are provisioned with explicit prod/service tags from the tailnet policy.	2026-04-18 05:48:26 +00:00
micqdf	ceefcc3b29	cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 13m9s Details - Remove ansible/roles/private-access/ (replaced by Tailscale LB services) - Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs) - Remove orphaned kustomization-traefik-config.yaml (no backing directory) - Simplify CI: remove SA patch + job deletion workaround for rancher-backup (now handled by postRenderer in HelmRelease) - Update AGENTS.md to reflect current architecture	2026-04-02 01:21:23 +00:00
micqdf	b8f64fa952	feat: Expose Grafana, Prometheus, and Flux UI via Tailscale LoadBalancer services Deploy Cluster / Terraform (push) Successful in 55s Details Deploy Cluster / Ansible (push) Successful in 20m47s Details Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB services matching the existing Rancher pattern. Each service gets its own tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).	2026-03-31 08:53:28 +00:00
micqdf	efdf13976a	fix: Handle missing 'online' field in Tailscale API response Deploy Cluster / Terraform (push) Successful in 2m12s Details Deploy Cluster / Ansible (push) Successful in 9m19s Details	2026-03-29 13:52:23 +00:00
micqdf	5269884408	feat: Auto-cleanup stale Tailscale devices before cluster boot Deploy Cluster / Terraform (push) Successful in 2m17s Details Deploy Cluster / Ansible (push) Failing after 6m35s Details Adds tailscale-cleanup Ansible role that uses the Tailscale API to delete offline devices matching reserved hostnames (e.g. rancher). Runs during site.yml before Finalize to prevent hostname collisions like rancher-1 on rebuild. Requires TAILSCALE_API_KEY (API access token) passed as extra var.	2026-03-29 11:47:53 +00:00
micqdf	6e5b0518be	feat: Add kubeconfig refresh script and fix Ansible Finalize to use public IP Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 5m25s Details - scripts/refresh-kubeconfig.sh fetches a fresh kubeconfig from CP1 - Ansible site.yml Finalize step now uses public IP instead of Tailscale hostname for the kubeconfig server address - Updated AGENTS.md with kubeconfig refresh instructions	2026-03-29 03:31:36 +00:00
micqdf	f36445d99a	Fix CNI: configure flannel to use private network interface (enp7s0) instead of public Deploy Cluster / Terraform (push) Successful in 34s Details Deploy Cluster / Ansible (push) Successful in 8m42s Details	2026-03-25 01:44:33 +00:00
micqdf	89c2c99963	Fix Rancher: remove conflicting LoadBalancer, add HTTPS port-forward, use tailscale serve only Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 9m2s Details	2026-03-25 00:59:16 +00:00
micqdf	ab2f287bfb	Fix Rancher: use correct service name cattle-system-rancher Deploy Cluster / Terraform (push) Successful in 39s Details Deploy Cluster / Ansible (push) Successful in 4m23s Details	2026-03-24 22:30:49 +00:00
micqdf	60ceac4624	Fix Rancher access: add kubectl port-forward + tailscale serve setup Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-24 20:01:57 +00:00
micqdf	0e52d8f159	Use Tailscale DNS names instead of IPs for TLS SANs Deploy Cluster / Terraform (push) Successful in 2m21s Details Deploy Cluster / Ansible (push) Successful in 9m0s Details Changed from hardcoded Tailscale IPs to DNS names: - k8s-cluster-cp-1.silverside-gopher.ts.net - k8s-cluster-cp-2.silverside-gopher.ts.net - k8s-cluster-cp-3.silverside-gopher.ts.net This is more robust since Tailscale IPs change on rebuild, but DNS names remain consistent. After next rebuild, cluster accessible via: - kubectl --server=https://k8s-cluster-cp-1.silverside-gopher.ts.net:6443	2026-03-23 23:50:48 +00:00
micqdf	4726db2b5b	Add Tailscale IPs to k3s TLS SANs for secure tailnet access Deploy Cluster / Terraform (push) Successful in 2m30s Details Deploy Cluster / Ansible (push) Successful in 9m48s Details Changes: - Add tailscale_control_plane_ips list to k3s-server defaults - Include all 3 control plane Tailscale IPs (100.120.55.97, 100.108.90.123, 100.92.149.85) - Update primary k3s install to add Tailscale IPs to TLS certificates - Enables kubectl access via Tailscale without certificate errors After next deploy, cluster will be accessible via: - kubectl --server=https://100.120.55.97:6443 (or any CP tailscale IP) - kubectl --server=https://k8s-cluster-cp-1:6443 (via tailscale DNS)	2026-03-23 23:04:00 +00:00

1 2 3

147 Commits