HetznerTerra

Author	SHA1	Message	Date
micqdf	4151027e01	fix: clean stale Tailscale node devices before bootstrap Deploy Cluster / Terraform (push) Successful in 1m40s Details Deploy Cluster / Ansible (push) Failing after 14m30s Details Run the Tailscale cleanup role against the cluster hostnames before any node reconnects to the tailnet. This removes stale offline cp/worker devices from previous rebuilds so replacement VMs can reclaim their original hostnames instead of getting -1 suffixes.	2026-04-23 03:25:17 +00:00
micqdf	9c0523e880	fix: pre-pull Rancher images and reset Rancher release during bootstrap Deploy Cluster / Terraform (push) Successful in 28s Details Deploy Cluster / Ansible (push) Failing after 27m30s Details Rancher installs were stalling on transient Docker Hub TLS handshake timeouts for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the required images onto all nodes after k3s comes up, extend the Rancher HelmRelease timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher so bootstrap can recover from stale failed remediation state.	2026-04-22 11:00:54 +00:00
micqdf	9a2d213114	fix: wait for cloud-init before package install during bootstrap Deploy Cluster / Terraform (push) Successful in 29s Details Deploy Cluster / Ansible (push) Failing after 2m36s Details Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot, which caused the Ansible common role to fail before the control plane could finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the final kubeconfig rewrite when no kubeconfig was fetched yet.	2026-04-22 03:34:53 +00:00
micqdf	b1dae28aa5	feat: migrate cluster baseline from Hetzner to Proxmox Deploy Cluster / Terraform (push) Failing after 52s Details Deploy Cluster / Ansible (push) Has been skipped Details Deploy Grafana Content / Grafana Content (push) Failing after 1m37s Details Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap, Flux addons, CI workflows, and docs to target the new private Proxmox baseline while preserving the existing Tailscale, Doppler, Flux, Rancher, and B2 backup flows.	2026-04-22 03:02:13 +00:00
micqdf	b20356e9fe	fix: only clean stale Tailscale names before proxies exist Deploy Cluster / Terraform (push) Failing after 51s Details Deploy Cluster / Ansible (push) Has been skipped Details The Tailscale cleanup role was deleting reserved service hostnames on later deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy nodes from the tailnet. Skip cleanup whenever the current cluster already has those Tailscale services, while still allowing cleanup on fresh rebuilds.	2026-04-18 18:16:27 +00:00
micqdf	68dbd2e5b7	fix: Reserve Tailscale service hostnames and tag exposed proxies Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 6m3s Details Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale services so operator-managed proxies are provisioned with explicit prod/service tags from the tailnet policy.	2026-04-18 05:48:26 +00:00
micqdf	b8f64fa952	feat: Expose Grafana, Prometheus, and Flux UI via Tailscale LoadBalancer services Deploy Cluster / Terraform (push) Successful in 55s Details Deploy Cluster / Ansible (push) Successful in 20m47s Details Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB services matching the existing Rancher pattern. Each service gets its own tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).	2026-03-31 08:53:28 +00:00
micqdf	5269884408	feat: Auto-cleanup stale Tailscale devices before cluster boot Deploy Cluster / Terraform (push) Successful in 2m17s Details Deploy Cluster / Ansible (push) Failing after 6m35s Details Adds tailscale-cleanup Ansible role that uses the Tailscale API to delete offline devices matching reserved hostnames (e.g. rancher). Runs during site.yml before Finalize to prevent hostname collisions like rancher-1 on rebuild. Requires TAILSCALE_API_KEY (API access token) passed as extra var.	2026-03-29 11:47:53 +00:00
micqdf	6e5b0518be	feat: Add kubeconfig refresh script and fix Ansible Finalize to use public IP Deploy Cluster / Terraform (push) Successful in 53s Details Deploy Cluster / Ansible (push) Successful in 5m25s Details - scripts/refresh-kubeconfig.sh fetches a fresh kubeconfig from CP1 - Ansible site.yml Finalize step now uses public IP instead of Tailscale hostname for the kubeconfig server address - Updated AGENTS.md with kubeconfig refresh instructions	2026-03-29 03:31:36 +00:00
micqdf	60ceac4624	Fix Rancher access: add kubectl port-forward + tailscale serve setup Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-24 20:01:57 +00:00
micqdf	90d105e5ea	Fix kube_api_endpoint variable passing for HA cluster Deploy Cluster / Terraform (push) Successful in 2m18s Details Deploy Cluster / Ansible (push) Successful in 8m55s Details - Remove circular variable reference in site.yml - Add kube_api_endpoint default to k3s-server role - Variable is set via inventory group_vars and passed to role - Primary CP now correctly adds LB IP to TLS SANs Note: Existing cluster needs destroy/rebuild to regenerate certificates.	2026-03-23 03:01:53 +00:00
micqdf	952a80a742	Fix HA cluster join via Load Balancer private IP Deploy Cluster / Terraform (push) Successful in 36s Details Deploy Cluster / Ansible (push) Failing after 3m5s Details Changes: - Use LB private IP (10.0.1.5) instead of public IP for cluster joins - Add LB private IP to k3s TLS SANs on primary control plane - This allows secondary CPs and workers to verify certificates when joining via LB Fixes x509 certificate validation error when joining via LB public IP.	2026-03-23 02:56:41 +00:00
micqdf	ff31cb4e74	Implement HA control plane with Load Balancer (3-3 topology) Deploy Cluster / Terraform (push) Failing after 10s Details Deploy Cluster / Ansible (push) Has been skipped Details Major changes: - Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33) - Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API - Terraform: Add kube_api_lb_ip output - Ansible: Add community.network collection to requirements - Ansible: Update inventory to include LB endpoint - Ansible: Configure secondary CPs and workers to join via LB - Ansible: Add k3s_join_endpoint variable for HA joins - Workflow: Add imports for cp-2, cp-3, and worker-3 - Docs: Update STABLE_BASELINE.md with HA topology and phase gates Topology: - 3 control planes (cx23 - 2 vCPU, 8GB RAM each) - 3 workers (cx33 - 4 vCPU, 16GB RAM each) - 1 Load Balancer (lb11) routing to all 3 control planes on port 6443 - Workers and secondary CPs join via LB endpoint for HA Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)	2026-03-23 02:39:39 +00:00
micqdf	31b82c9371	Deploy CCM via Ansible before workers join to fix external cloud provider Deploy Cluster / Terraform (push) Successful in 31s Details Deploy Cluster / Ansible (push) Failing after 1m48s Details This fixes the chicken-and-egg problem where workers with --kubelet-arg=cloud-provider=external couldn't join because CCM wasn't running yet to remove the node.cloudprovider.kubernetes.io/uninitialized taint. Changes: - Create ansible/roles/ccm-deploy/ to deploy CCM via Helm during Ansible phase - Reorder site.yml: CCM deploys after secrets but before workers join - CCM runs on control_plane[0] with proper tolerations for control plane nodes - Add 10s pause after CCM ready to ensure it can process new nodes - Workers can now successfully join with external cloud provider enabled Flux still manages CCM for updates, but initial install happens in Ansible.	2026-03-22 23:58:03 +00:00
micqdf	08a3031276	refactor: retire imperative addon roles Deploy Cluster / Terraform (push) Successful in 52s Details Deploy Cluster / Ansible (push) Successful in 4m2s Details	2026-03-17 01:04:02 +00:00
micqdf	bed8e4afc8	feat: migrate core addons toward flux Deploy Cluster / Terraform (push) Successful in 49s Details Deploy Cluster / Ansible (push) Successful in 4m6s Details	2026-03-11 17:43:35 +00:00
micqdf	6f2e056b98	feat: sync runtime secrets from doppler Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Successful in 9m56s Details	2026-03-09 00:25:41 +00:00
micqdf	f95e0051a5	feat: automate private tailnet access on cp1 Deploy Cluster / Terraform (push) Successful in 47s Details Deploy Cluster / Ansible (push) Successful in 9m45s Details	2026-03-08 04:16:06 +00:00
micqdf	86fb5d5b90	fix: move observability gitops gating to role level Deploy Cluster / Terraform (push) Successful in 44s Details Deploy Cluster / Ansible (push) Successful in 9m17s Details	2026-03-05 00:17:25 +00:00
micqdf	8b403cd1d6	feat: migrate observability stack to flux gitops Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Failing after 1m11s Details	2026-03-04 23:38:40 +00:00
micqdf	2f166ed9e7	feat: manage grafana content as code with fast dashboard workflow Deploy Cluster / Terraform (push) Successful in 46s Details Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Grafana Content / Grafana Content (push) Has been cancelled Details	2026-03-04 03:36:01 +00:00
micqdf	a0ed6523ec	feat: add Tailscale Kubernetes Operator for Grafana/Prometheus access Deploy Cluster / Ansible (push) Has been cancelled Details Deploy Cluster / Terraform (push) Has been cancelled Details	2026-03-02 20:28:51 +00:00
micqdf	b30977a158	feat: deploy lightweight observability stack via Ansible Deploy Cluster / Terraform (push) Successful in 45s Details Deploy Cluster / Ansible (push) Has been cancelled Details	2026-03-02 01:33:41 +00:00
micqdf	2bc9749b81	feat: switch kubeconfig to tailnet endpoint and deploy Hetzner CSI Deploy Cluster / Terraform (push) Successful in 51s Details Deploy Cluster / Ansible (push) Successful in 3m12s Details	2026-03-01 17:12:12 +00:00
micqdf	b5b8f89dc2	fix: derive k3s node IPs from terraform private addresses Deploy Cluster / Terraform (push) Successful in 18s Details Deploy Cluster / Ansible (push) Failing after 3m9s Details	2026-03-01 03:08:56 +00:00
micqdf	b703cb269b	fix: bootstrap k3s HA on private network with dual SANs Deploy Cluster / Terraform (push) Successful in 2m31s Details Deploy Cluster / Ansible (push) Failing after 4m38s Details	2026-03-01 02:45:00 +00:00
micqdf	64dfbf7315	fix: use primary public IP for k3s join to match existing API cert SAN Deploy Cluster / Terraform (push) Successful in 18s Details Deploy Cluster / Ansible (push) Failing after 17m50s Details	2026-03-01 02:25:13 +00:00
micqdf	27b29322cd	fix: use private network IPs for k3s join and node addressing Deploy Cluster / Terraform (push) Successful in 24s Details Deploy Cluster / Ansible (push) Failing after 8m13s Details	2026-03-01 00:42:55 +00:00
micqdf	1db435cd42	fix: Use private IP for k3s HA cluster join and advertise Deploy Cluster / Terraform (push) Successful in 19s Details Deploy Cluster / Ansible (push) Failing after 8m11s Details	2026-03-01 00:32:03 +00:00
micqdf	3b3084b997	feat: Add HA Kubernetes cluster with Terraform + Ansible Terraform / Validate (push) Failing after 17s Details Terraform / Plan (push) Has been skipped Details Terraform / Apply (push) Has been skipped Details - 3x CX23 control plane nodes (HA) - 4x CX33 worker nodes - k3s with embedded etcd - Hetzner CCM for load balancers - Gitea CI/CD workflows - Backblaze B2 for Terraform state	2026-02-28 20:24:55 +00:00

30 Commits