TerraHome

Author	SHA1	Message	Date
MichaelFisher1997	63213a4bc3	fix: ignore stale SSH host keys for ephemeral homelab VMs All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Fresh destroy/recreate cycles change VM host keys, which was breaking bootstrap after rebuilds. Use a disposable known-hosts policy in the controller SSH options so automation does not fail on expected key rotation.	2026-03-09 03:16:18 +00:00
MichaelFisher1997	808c290c71	chore: clarify stale template cloud-init failure message Some checks failed Terraform Plan / Terraform Plan (push) Failing after 31s Details Make SSH bootstrap failures explain the real root cause when fresh clones never accept the injected user/key: the Proxmox source template itself still needs the updated cloud-init-capable NixOS configuration.	2026-03-08 13:16:37 +00:00
MichaelFisher1997	4c167f618a	fix: wait for SSH readiness after VM provisioning All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.	2026-03-08 05:00:39 +00:00
MichaelFisher1997	b7b364a112	fix: vendor Flannel manifest and harden CNI bootstrap timing All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Stop depending on GitHub during cluster bring-up by shipping the Flannel manifest in-repo, ensure required host paths exist on NixOS nodes, and wait/retry against a stable API before applying the CNI. This removes the TLS handshake timeout failure mode and makes early network bootstrap deterministic.	2026-03-08 03:24:16 +00:00
MichaelFisher1997	065567210e	debug: print detailed Flannel pod diagnostics on rollout timeout All checks were successful Terraform Plan / Terraform Plan (push) Successful in 18s Details When kube-flannel daemonset rollout stalls, print pod descriptions and per-container logs for the init containers and main flannel container so the next failure shows the actual cause instead of only Init:0/2.	2026-03-07 12:19:21 +00:00
MichaelFisher1997	a0b07816b9	refactor: simplify homelab bootstrap around static IPs and fresh runs Some checks failed Terraform Plan / Terraform Plan (push) Failing after 10s Details Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.	2026-03-07 00:52:35 +00:00
MichaelFisher1997	e06b2c692e	fix: point Cilium directly at API server and print rollout diagnostics All checks were successful Terraform Plan / Terraform Plan (push) Successful in 18s Details Set Cilium k8sServiceHost/k8sServicePort to the primary control-plane API endpoint to avoid in-cluster service routing dependency during early bootstrap. Also print cilium daemonset/pod/log diagnostics when rollout times out.	2026-03-05 01:21:21 +00:00
MichaelFisher1997	ca54c44fa4	fix: stabilize Cilium install defaults and add rollout diagnostics All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Set Cilium kubeProxyReplacement from env (default false for homelab stability) and collect cilium daemonset/pod/log diagnostics when rollout times out during verification.	2026-03-05 00:48:41 +00:00
MichaelFisher1997	a70de061b0	fix: wait for Cilium and node readiness before marking bootstrap success All checks were successful Terraform Plan / Terraform Plan (push) Successful in 18s Details Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.	2026-03-04 22:26:43 +00:00
MichaelFisher1997	5ddd00f711	fix: add join preflight ignores for homelab control planes All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.	2026-03-04 21:09:27 +00:00
MichaelFisher1997	422b7d7f23	fix: force fresh kubeadm init after rebuild and make kubelet enable-able All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Always re-run primary init when reconcile performs node rebuilds to avoid stale/partial cluster state causing join preflight failures. Also add wantedBy for kubelet so systemctl enable works as expected during join/init flows.	2026-03-04 00:55:20 +00:00
MichaelFisher1997	3ebeb121b4	fix: force fresh bootstrap stages after rebuild and stabilize join node identity All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Clear completed bootstrap stage checkpoints whenever nodes are rebuilt so reconcile does not skip required init/cni/join work on fresh hosts. Also pass explicit --node-name for control-plane and worker joins, and ensure kubelet is enabled before join commands run.	2026-03-04 00:26:37 +00:00
MichaelFisher1997	a66ae788f6	fix: run Cilium install with sudo and explicit kubeconfig All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Use sudo for helm/kubectl on cp-1 and pass /etc/kubernetes/admin.conf so controller can install Cilium without permission errors.	2026-03-03 08:55:22 +00:00
MichaelFisher1997	cbb8358ce6	fix: ensure kubelet is enabled for kubeadm init node registration All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Enable kubelet before kubeadm init and stop forcing kubelet out of wantedBy so kubeadm can reliably register the node during upload-config/kubelet. Also clear stale kubelet config files during remote prep to avoid restart-loop leftovers.	2026-03-03 01:04:50 +00:00
MichaelFisher1997	a16112a87a	fix: rebuild nodes by default on reconcile All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Do not skip node rebuilds unless SKIP_REBUILD=1 is explicitly set. This prevents stale remote helper scripts from being reused across retries after bootstrap logic changes.	2026-03-03 00:34:55 +00:00
MichaelFisher1997	6fecfb3ee6	refactor: add Python bootstrap controller with resumable state All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Introduce a clean orchestration layer in nixos/kubeadm/bootstrap/controller.py and slim rebuild-and-bootstrap.sh into a thin wrapper. The controller now owns preflight, rebuild, init, CNI install, join, and verify stages with persisted checkpoints on cp-1 plus a local state copy for CI debugging.	2026-03-03 00:09:10 +00:00

16 Commits