Commit Graph

13 Commits

Author SHA1 Message Date
b7b364a112 fix: vendor Flannel manifest and harden CNI bootstrap timing
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Stop depending on GitHub during cluster bring-up by shipping the Flannel manifest in-repo, ensure required host paths exist on NixOS nodes, and wait/retry against a stable API before applying the CNI. This removes the TLS handshake timeout failure mode and makes early network bootstrap deterministic.
2026-03-08 03:24:16 +00:00
065567210e debug: print detailed Flannel pod diagnostics on rollout timeout
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
When kube-flannel daemonset rollout stalls, print pod descriptions and per-container logs for the init containers and main flannel container so the next failure shows the actual cause instead of only Init:0/2.
2026-03-07 12:19:21 +00:00
a0b07816b9 refactor: simplify homelab bootstrap around static IPs and fresh runs
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 10s
Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.
2026-03-07 00:52:35 +00:00
e06b2c692e fix: point Cilium directly at API server and print rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Set Cilium k8sServiceHost/k8sServicePort to the primary control-plane API endpoint to avoid in-cluster service routing dependency during early bootstrap. Also print cilium daemonset/pod/log diagnostics when rollout times out.
2026-03-05 01:21:21 +00:00
ca54c44fa4 fix: stabilize Cilium install defaults and add rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set Cilium kubeProxyReplacement from env (default false for homelab stability) and collect cilium daemonset/pod/log diagnostics when rollout times out during verification.
2026-03-05 00:48:41 +00:00
a70de061b0 fix: wait for Cilium and node readiness before marking bootstrap success
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.
2026-03-04 22:26:43 +00:00
5ddd00f711 fix: add join preflight ignores for homelab control planes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.
2026-03-04 21:09:27 +00:00
422b7d7f23 fix: force fresh kubeadm init after rebuild and make kubelet enable-able
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Always re-run primary init when reconcile performs node rebuilds to avoid stale/partial cluster state causing join preflight failures. Also add wantedBy for kubelet so systemctl enable works as expected during join/init flows.
2026-03-04 00:55:20 +00:00
3ebeb121b4 fix: force fresh bootstrap stages after rebuild and stabilize join node identity
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear completed bootstrap stage checkpoints whenever nodes are rebuilt so reconcile does not skip required init/cni/join work on fresh hosts. Also pass explicit --node-name for control-plane and worker joins, and ensure kubelet is enabled before join commands run.
2026-03-04 00:26:37 +00:00
a66ae788f6 fix: run Cilium install with sudo and explicit kubeconfig
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Use sudo for helm/kubectl on cp-1 and pass /etc/kubernetes/admin.conf so controller can install Cilium without permission errors.
2026-03-03 08:55:22 +00:00
cbb8358ce6 fix: ensure kubelet is enabled for kubeadm init node registration
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Enable kubelet before kubeadm init and stop forcing kubelet out of wantedBy so kubeadm can reliably register the node during upload-config/kubelet. Also clear stale kubelet config files during remote prep to avoid restart-loop leftovers.
2026-03-03 01:04:50 +00:00
a16112a87a fix: rebuild nodes by default on reconcile
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Do not skip node rebuilds unless SKIP_REBUILD=1 is explicitly set. This prevents stale remote helper scripts from being reused across retries after bootstrap logic changes.
2026-03-03 00:34:55 +00:00
6fecfb3ee6 refactor: add Python bootstrap controller with resumable state
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Introduce a clean orchestration layer in nixos/kubeadm/bootstrap/controller.py and slim rebuild-and-bootstrap.sh into a thin wrapper. The controller now owns preflight, rebuild, init, CNI install, join, and verify stages with persisted checkpoints on cp-1 plus a local state copy for CI debugging.
2026-03-03 00:09:10 +00:00