HetznerTerra

Author	SHA1	Message	Date
micqdf	a33a993867	fix: harden cluster rebuild determinism Deploy Grafana Content / Grafana Content (push) Failing after 1m14s Details Deploy Cluster / Terraform (push) Failing after 4m59s Details Deploy Cluster / Ansible (push) Has been skipped Details	2026-04-30 07:36:27 +00:00
micqdf	fd5451a5ef	fix: wait for ssh before gathering facts Deploy Cluster / Terraform (push) Successful in 30s Details Deploy Cluster / Ansible (push) Failing after 1h13m38s Details	2026-04-30 03:44:13 +00:00
micqdf	46b2ff7d19	fix: harden final health checks Deploy Cluster / Terraform (push) Successful in 31s Details Deploy Cluster / Ansible (push) Failing after 17m50s Details	2026-04-26 02:14:02 +00:00
micqdf	347ca041ba	fix: reduce rerun bootstrap pre-pull delays Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 39m26s Details	2026-04-24 12:09:34 +00:00
micqdf	ee6417c18e	fix: pre-pull core bootstrap images on cp1 before Flux bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Has been cancelled Details Fresh clusters were repeatedly timing out while kubelet pulled the pause image, k3s packaged component images, and Flux controller images onto the first control plane. Pre-pull the core control-plane bootstrap images into containerd on cp-1 so Flux and packaged addons start from a warm cache instead of racing registry TLS timeouts.	2026-04-23 05:55:14 +00:00
micqdf	4151027e01	fix: clean stale Tailscale node devices before bootstrap Deploy Cluster / Terraform (push) Successful in 1m40s Details Deploy Cluster / Ansible (push) Failing after 14m30s Details Run the Tailscale cleanup role against the cluster hostnames before any node reconnects to the tailnet. This removes stale offline cp/worker devices from previous rebuilds so replacement VMs can reclaim their original hostnames instead of getting -1 suffixes.	2026-04-23 03:25:17 +00:00
micqdf	9c0523e880	fix: pre-pull Rancher images and reset Rancher release during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 27m30s Details Rancher installs were stalling on transient Docker Hub TLS handshake timeouts for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the required images onto all nodes after k3s comes up, extend the Rancher HelmRelease timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher so bootstrap can recover from stale failed remediation state.	2026-04-22 11:00:54 +00:00
micqdf	9a2d213114	fix: wait for cloud-init before package install during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 2m36s Details Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot, which caused the Ansible common role to fail before the control plane could finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the final kubeconfig rewrite when no kubeconfig was fetched yet.	2026-04-22 03:34:53 +00:00
micqdf	b1dae28aa5	feat: migrate cluster baseline from Hetzner to Proxmox Deploy Cluster / Terraform (push) Failing after 52s Details Deploy Cluster / Ansible (push) Has been skipped Details Deploy Grafana Content / Grafana Content (push) Failing after 1m37s Details Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.	2026-04-22 03:02:13 +00:00
micqdf	b20356e9fe	fix: only clean stale Tailscale names before proxies exist Deploy Cluster / Terraform (push) Failing after 51s Details Deploy Cluster / Ansible (push) Has been skipped Details The Tailscale cleanup role was deleting reserved service hostnames on later deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy nodes from the tailnet. Skip cleanup whenever the current cluster already has those Tailscale services, while still allowing cleanup on fresh rebuilds.	2026-04-18 18:16:27 +00:00
micqdf	68dbd2e5b7	fix: Reserve Tailscale service hostnames and tag exposed proxies Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 6m3s Details Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale services so operator-managed proxies are provisioned with explicit prod/service tags from the tailnet policy.	2026-04-18 05:48:26 +00:00
micqdf	b8f64fa952	feat: Expose Grafana, Prometheus, and Flux UI via Tailscale LoadBalancer services Deploy Cluster / Terraform (push) Successful in 55s Details Deploy Cluster / Ansible (push) Successful in 20m47s Details Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB services matching the existing Rancher pattern. Each service gets its own tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).	2026-03-31 08:53:28 +00:00
micqdf	5269884408	feat: Auto-cleanup stale Tailscale devices before cluster boot Deploy Cluster / Terraform (push) Successful in 2m17s Details Deploy Cluster / Ansible (push) Failing after 6m35s Details Adds tailscale-cleanup Ansible role that uses the Tailscale API to delete offline devices matching reserved hostnames (e.g. rancher). Runs during site.yml before Finalize to prevent hostname collisions like rancher-1 on rebuild. Requires TAILSCALE_API_KEY (API access token) passed as extra var.	2026-03-29 11:47:53 +00:00
micqdf	6e5b0518be	feat: Add kubeconfig refresh script and fix Ansible Finalize to use public IP Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 5m25s Details - scripts/refresh-kubeconfig.sh fetches a fresh kubeconfig from CP1 - Ansible site.yml Finalize step now uses public IP instead of Tailscale hostname for the kubeconfig server address - Updated AGENTS.md with kubeconfig refresh instructions	2026-03-29 03:31:36 +00:00
micqdf	60ceac4624	Fix Rancher access: add kubectl port-forward + tailscale serve setup Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-24 20:01:57 +00:00
micqdf	90d105e5ea	Fix kube_api_endpoint variable passing for HA cluster Deploy Cluster / Terraform (push) Successful in 2m18s Details Deploy Cluster / Ansible (push) Successful in 8m55s Details - Remove circular variable reference in site.yml - Add kube_api_endpoint default to k3s-server role - Variable is set via inventory group_vars and passed to role - Primary CP now correctly adds LB IP to TLS SANs Note: Existing cluster needs destroy/rebuild to regenerate certificates.	2026-03-23 03:01:53 +00:00
micqdf	952a80a742	Fix HA cluster join via Load Balancer private IP Deploy Cluster / Terraform (push) Successful in 36s Details Deploy Cluster / Ansible (push) Failing after 3m5s Details Changes: - Use LB private IP (10.0.1.5) instead of public IP for cluster joins - Add LB private IP to k3s TLS SANs on primary control plane - This allows secondary CPs and workers to verify certificates when joining via LB Fixes x509 certificate validation error when joining via LB public IP.	2026-03-23 02:56:41 +00:00
micqdf	ff31cb4e74	Implement HA control plane with Load Balancer (3-3 topology) Deploy Cluster / Terraform (push) Failing after 10s Details Deploy Cluster / Ansible (push) Has been skipped Details Major changes: - Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33) - Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API - Terraform: Add kube_api_lb_ip output - Ansible: Add community.network collection to requirements - Ansible: Update inventory to include LB endpoint - Ansible: Configure secondary CPs and workers to join via LB - Ansible: Add k3s_join_endpoint variable for HA joins - Workflow: Add imports for cp-2, cp-3, and worker-3 - Docs: Update STABLE_BASELINE.md with HA topology and phase gates Topology: - 3 control planes (cx23 - 2 vCPU, 8GB RAM each) - 3 workers (cx33 - 4 vCPU, 16GB RAM each) - 1 Load Balancer (lb11) routing to all 3 control planes on port 6443 - Workers and secondary CPs join via LB endpoint for HA Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)	2026-03-23 02:39:39 +00:00
micqdf	31b82c9371	Deploy CCM via Ansible before workers join to fix external cloud provider Deploy Cluster / Terraform (push) Successful in 31s Details Deploy Cluster / Ansible (push) Failing after 1m48s Details This fixes the chicken-and-egg problem where workers with --kubelet-arg=cloud-provider=external couldn't join because CCM wasn't running yet to remove the node.cloudprovider.kubernetes.io/uninitialized taint. Changes: - Create ansible/roles/ccm-deploy/ to deploy CCM via Helm during Ansible phase - Reorder site.yml: CCM deploys after secrets but before workers join - CCM runs on control_plane[0] with proper tolerations for control plane nodes - Add 10s pause after CCM ready to ensure it can process new nodes - Workers can now successfully join with external cloud provider enabled Flux still manages CCM for updates, but initial install happens in Ansible.	2026-03-22 23:58:03 +00:00
micqdf	08a3031276	refactor: retire imperative addon roles Deploy Cluster / Terraform (push) Successful in 52s Details Deploy Cluster / Ansible (push) Successful in 4m2s Details	2026-03-17 01:04:02 +00:00
micqdf	bed8e4afc8	feat: migrate core addons toward flux Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 4m6s Details	2026-03-11 17:43:35 +00:00
micqdf	6f2e056b98	feat: sync runtime secrets from doppler Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Successful in 9m56s Details	2026-03-09 00:25:41 +00:00
micqdf	f95e0051a5	feat: automate private tailnet access on cp1 Deploy Cluster / Terraform (push) Successful in 47s Details Deploy Cluster / Ansible (push) Successful in 9m45s Details	2026-03-08 04:16:06 +00:00
micqdf	86fb5d5b90	fix: move observability gitops gating to role level Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Successful in 9m17s Details	2026-03-05 00:17:25 +00:00
micqdf	8b403cd1d6	feat: migrate observability stack to flux gitops Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Failing after 1m11s Details	2026-03-04 23:38:40 +00:00
micqdf	2f166ed9e7	feat: manage grafana content as code with fast dashboard workflow Deploy Cluster / Terraform (push) Successful in 46s Details Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Grafana Content / Grafana Content (push) Has been cancelled Details	2026-03-04 03:36:01 +00:00
micqdf	a0ed6523ec	feat: add Tailscale Kubernetes Operator for Grafana/Prometheus access Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-02 20:28:51 +00:00
micqdf	b30977a158	feat: deploy lightweight observability stack via Ansible Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-02 01:33:41 +00:00
micqdf	2bc9749b81	feat: switch kubeconfig to tailnet endpoint and deploy Hetzner CSI Deploy Cluster / Terraform (push) Successful in 51s Details Deploy Cluster / Ansible (push) Successful in 3m12s Details	2026-03-01 17:12:12 +00:00
micqdf	b5b8f89dc2	fix: derive k3s node IPs from terraform private addresses Deploy Cluster / Terraform (push) Successful in 18s Details Deploy Cluster / Ansible (push) Failing after 3m9s Details	2026-03-01 03:08:56 +00:00
micqdf	b703cb269b	fix: bootstrap k3s HA on private network with dual SANs Deploy Cluster / Terraform (push) Successful in 2m31s Details Deploy Cluster / Ansible (push) Failing after 4m38s Details	2026-03-01 02:45:00 +00:00
micqdf	64dfbf7315	fix: use primary public IP for k3s join to match existing API cert SAN Deploy Cluster / Terraform (push) Successful in 18s Details Deploy Cluster / Ansible (push) Failing after 17m50s Details	2026-03-01 02:25:13 +00:00
micqdf	27b29322cd	fix: use private network IPs for k3s join and node addressing Deploy Cluster / Terraform (push) Successful in 24s Details Deploy Cluster / Ansible (push) Failing after 8m13s Details	2026-03-01 00:42:55 +00:00
micqdf	1db435cd42	fix: Use private IP for k3s HA cluster join and advertise Deploy Cluster / Terraform (push) Successful in 19s Details Deploy Cluster / Ansible (push) Failing after 8m11s Details	2026-03-01 00:32:03 +00:00
micqdf	3b3084b997	feat: Add HA Kubernetes cluster with Terraform + Ansible Terraform / Validate (push) Failing after 17s Details Terraform / Plan (push) Has been skipped Details Terraform / Apply (push) Has been skipped Details - 3x CX23 control plane nodes (HA) - 4x CX33 worker nodes - k3s with embedded etcd - Hetzner CCM for load balancers - Gitea CI/CD workflows - Backblaze B2 for Terraform state	2026-02-28 20:24:55 +00:00

35 Commits