Commit Graph

333 Commits

Author SHA1 Message Date
0778de9719 fix: hard-reset nodes before kubeadm join retries
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Before control-plane and worker joins, remove stale kubelet/kubernetes identity files and run kubeadm reset -f. This prevents preflight failures like FileAvailable--etc-kubernetes-kubelet.conf during repeated reconcile attempts.
2026-03-04 23:38:15 +00:00
fc4eb1bc6e fix: add heuristic SSH inventory fallback for generic hostnames
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When Proxmox guest-agent IPs are empty and SSH discovery returns duplicate generic hostnames (e.g. flex), assign remaining missing nodes from unmatched SSH-reachable IPs in deterministic order. Also emit SSH-reachable IP diagnostics on failure.
2026-03-04 23:07:45 +00:00
a70de061b0 fix: wait for Cilium and node readiness before marking bootstrap success
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.
2026-03-04 22:26:43 +00:00
5ddd00f711 fix: add join preflight ignores for homelab control planes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.
2026-03-04 21:09:27 +00:00
034869347a fix: require kubelet kubeconfig before starting service
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Inline kubelet bootstrap/kubeconfig flags in ExecStart and gate startup on /etc/kubernetes/*kubelet.conf in addition to config.yaml. This prevents kubelet entering standalone mode with webhook auth enabled when no client config is present.
2026-03-04 20:45:47 +00:00
f0093deedc fix: avoid assigning control-plane VIP as node SSH address
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 15s
Exclude the configured VIP suffix from subnet scans and prefer non-VIP IPs when multiple SSH endpoints resolve to the same node. This prevents cp-1 being discovered as .250 and later failing SSH commands against the floating VIP.
2026-03-04 19:26:37 +00:00
6b6ca021c9 fix: add kubelet bootstrap kubeconfig args to systemd unit
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Include KUBELET_KUBECONFIG_ARGS in kubelet ExecStart so kubelet can authenticate with bootstrap-kubelet.conf/kubelet.conf and register node objects during kubeadm init.
2026-03-04 19:26:07 +00:00
90ef0ec33f Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-03-04 18:42:22 +00:00
ba6cf42c04 fix: restart kubelet during CRISocket recovery and add registration diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When kubeadm init fails at upload-config/kubelet due missing node object, explicitly restart kubelet to ensure bootstrap flags are loaded before waiting for node registration. Add kubelet flag dump and focused registration log output to surface auth/cert errors.
2026-03-04 18:37:50 +00:00
3cd0c70727 fix: stop overriding kubelet config in kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Remove custom KubeletConfiguration from init config so kubeadm uses default kubelet authn/authz settings and bootstrap registration path. This avoids the standalone-style kubelet behavior where the node never appears in the API.
2026-03-04 18:35:34 +00:00
3281ebd216 Merge pull request 'fix: recover from kubeadm CRISocket node-registration race' (#111) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m6s
Reviewed-on: #111
2026-03-04 03:03:17 +00:00
d2dd6105a6 fix: recover from kubeadm CRISocket node-registration race
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Handle kubeadm init failures where upload-config/kubelet runs before the node object exists. When that specific error occurs, wait for cp-1 registration and run upload-config kubelet phase explicitly instead of aborting immediately.
2026-03-04 03:00:34 +00:00
981afc509a Merge pull request 'fix: use kubeadm v1beta4 list format for kubeletExtraArgs' (#110) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m48s
Reviewed-on: #110
2026-03-04 02:32:22 +00:00
b3c975bd73 fix: use kubeadm v1beta4 list format for kubeletExtraArgs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
kubeadm v1beta4 expects nodeRegistration.kubeletExtraArgs as a list of name/value args, not a map. Switch hostname-override to the correct structure so init config unmarshals successfully.
2026-03-04 02:00:07 +00:00
8aab666fad Merge pull request 'fix: hard reset kubelet identity before kubeadm init' (#109) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 12m25s
Reviewed-on: #109
2026-03-04 01:42:55 +00:00
308a2fd4b7 fix: hard reset kubelet identity before kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear kubelet cert/bootstrap artifacts after reset and force hostname override in kubeadm nodeRegistration so the node consistently registers as cp-1 instead of inheriting stale template identity.
2026-03-04 01:35:41 +00:00
3fd7ed48b1 Merge pull request 'fix: pin kubeadm init node identity to flake hostname' (#108) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 15m22s
Reviewed-on: #108
2026-03-04 01:18:51 +00:00
0cc0de2aea fix: pin kubeadm init node identity to flake hostname
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set hostname before init and inject nodeRegistration.name into kubeadm InitConfiguration so cp-1 registers as the expected node (cp-1) instead of inheriting the template hostname. This fixes upload-config/kubelet failures caused by node lookup for k8s-base-template.
2026-03-04 01:17:44 +00:00
99458ca829 Merge pull request 'fix: force fresh kubeadm init after rebuild and make kubelet enable-able' (#107) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m1s
Reviewed-on: #107
2026-03-04 00:56:30 +00:00
422b7d7f23 fix: force fresh kubeadm init after rebuild and make kubelet enable-able
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Always re-run primary init when reconcile performs node rebuilds to avoid stale/partial cluster state causing join preflight failures. Also add wantedBy for kubelet so systemctl enable works as expected during join/init flows.
2026-03-04 00:55:20 +00:00
adc8a620f4 Merge pull request 'fix: force fresh bootstrap stages after rebuild and stabilize join node identity' (#106) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m28s
Reviewed-on: #106
2026-03-04 00:32:06 +00:00
3ebeb121b4 fix: force fresh bootstrap stages after rebuild and stabilize join node identity
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear completed bootstrap stage checkpoints whenever nodes are rebuilt so reconcile does not skip required init/cni/join work on fresh hosts. Also pass explicit --node-name for control-plane and worker joins, and ensure kubelet is enabled before join commands run.
2026-03-04 00:26:37 +00:00
f11aadf79c Merge pull request 'fix: map SSH-discovered nodes by VMID when hostnames are generic' (#105) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 27m43s
Reviewed-on: #105
2026-03-03 23:37:45 +00:00
b4265a649e fix: map SSH-discovered nodes by VMID when hostnames are generic
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Some freshly cloned VMs still report template/generic hostnames during discovery. Probe DMI product serial over SSH and map it to Terraform VMIDs so cp-2/cp-3/wk-2 can be resolved even before hostname reconciliation.
2026-03-03 22:16:35 +00:00
09d2f56967 Merge pull request 'fix: make SSH inventory discovery more reliable on CI' (#104) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m46s
Reviewed-on: #104
2026-03-03 21:45:57 +00:00
9ae8eb6134 fix: make SSH inventory discovery more reliable on CI
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Increase default SSH timeout, reduce scan concurrency, and add a second slower scan pass to avoid transient misses on busy runners. Also print discovered hostnames to improve failure diagnostics when node-name matching fails.
2026-03-03 21:08:29 +00:00
f2b9da8a59 Merge pull request 'fix: run Cilium install with sudo and explicit kubeconfig' (#103) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 3m22s
Reviewed-on: #103
2026-03-03 08:56:49 +00:00
a66ae788f6 fix: run Cilium install with sudo and explicit kubeconfig
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Use sudo for helm/kubectl on cp-1 and pass /etc/kubernetes/admin.conf so controller can install Cilium without permission errors.
2026-03-03 08:55:22 +00:00
5fa96e27d7 Merge pull request 'fix: ensure kubelet is enabled for kubeadm init node registration' (#102) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 10m43s
Reviewed-on: #102
2026-03-03 01:13:47 +00:00
cbb8358ce6 fix: ensure kubelet is enabled for kubeadm init node registration
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Enable kubelet before kubeadm init and stop forcing kubelet out of wantedBy so kubeadm can reliably register the node during upload-config/kubelet. Also clear stale kubelet config files during remote prep to avoid restart-loop leftovers.
2026-03-03 01:04:50 +00:00
31017b5c3e Merge pull request 'fix: rebuild nodes by default on reconcile' (#101) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 13m53s
Reviewed-on: #101
2026-03-03 00:46:26 +00:00
a16112a87a fix: rebuild nodes by default on reconcile
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Do not skip node rebuilds unless SKIP_REBUILD=1 is explicitly set. This prevents stale remote helper scripts from being reused across retries after bootstrap logic changes.
2026-03-03 00:34:55 +00:00
f53d087c9c Merge pull request 'fix: use valid kube-vip log flag value' (#100) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m29s
Reviewed-on: #100
2026-03-03 00:26:08 +00:00
51b56e562e fix: use valid kube-vip log flag value
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
kube-vip expects an unsigned integer for --log. Replace --log -4 with --log 4 so manifest generation no longer fails during bootstrap.
2026-03-03 00:25:25 +00:00
0e0643a6fc Merge pull request 'refactor: add Python bootstrap controller with resumable state' (#99) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m46s
Reviewed-on: #99
2026-03-03 00:10:19 +00:00
6fecfb3ee6 refactor: add Python bootstrap controller with resumable state
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Introduce a clean orchestration layer in nixos/kubeadm/bootstrap/controller.py and slim rebuild-and-bootstrap.sh into a thin wrapper. The controller now owns preflight, rebuild, init, CNI install, join, and verify stages with persisted checkpoints on cp-1 plus a local state copy for CI debugging.
2026-03-03 00:09:10 +00:00
7a0016b003 Merge pull request 'fix: preserve kube-vip mount path and only swap hostPath to super-admin' (#98) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #98
2026-03-03 00:00:48 +00:00
355273add5 fix: preserve kube-vip mount path and only swap hostPath to super-admin
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
The previous replacement changed both mountPath and hostPath, causing kube-vip to lose its expected in-container kubeconfig path and exit. Keep mountPath at /etc/kubernetes/admin.conf, swap only hostPath during bootstrap, and enable kube-vip debug log level.
2026-03-02 23:59:41 +00:00
e5162c220c Merge pull request 'fix: bootstrap kube-vip without leader election' (#97) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m12s
Reviewed-on: #97
2026-03-02 23:31:52 +00:00
262e9eb4d7 fix: bootstrap kube-vip without leader election
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Run first-control-plane kube-vip manifest without --leaderElection so VIP can bind before API/RBAC are fully available. Also print kube-vip container exit details on failure.
2026-03-02 23:28:44 +00:00
84513f4bb8 Merge pull request 'fix: run kube-vip in control-plane-only mode during bootstrap' (#96) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m50s
Reviewed-on: #96
2026-03-02 22:53:22 +00:00
c445638d4a fix: run kube-vip in control-plane-only mode during bootstrap
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Remove --services from kube-vip static pod manifests for init/join. Service LB mode can crash-loop during kubeadm bootstrap before cluster RBAC is ready, which prevented VIP binding.
2026-03-02 22:52:44 +00:00
678b383063 Merge pull request 'stage' (#95) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m14s
Reviewed-on: #95
2026-03-02 22:33:27 +00:00
880bbcceca ci: speed up Terraform plan by skipping refresh in pipelines
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false in plan/apply workflows to avoid slow Proxmox state refresh on every push. This keeps CI fast while preserving apply behavior from the generated plan.
2026-03-02 22:32:10 +00:00
190dc2e095 fix: restore compatibility with older nixos-rebuild sudo flag
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Use --use-remote-sudo in rebuild script since the runner's nixos-rebuild does not support --sudo yet.
2026-03-02 22:30:38 +00:00
d86b0a32a2 Merge pull request 'fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency' (#94) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m3s
Reviewed-on: #94
2026-03-02 22:13:28 +00:00
a81799a2b5 fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Move kubeadm reset ahead of kube-vip manifest generation, use super-admin.conf during bootstrap for kube-vip, and restore admin.conf after init. Also switch nixos-rebuild to --sudo and make QEMU guest agent optional so Terraform plan can skip slow guest-agent refreshes when it is not installed.
2026-03-02 22:09:10 +00:00
6c7182b8f5 Merge pull request 'fix: run kube-vip daemon before kubeadm init' (#93) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m52s
Reviewed-on: #93
2026-03-02 21:02:11 +00:00
46c0786e57 fix: run kube-vip daemon before kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m8s
- Start kube-vip as a detached container to claim VIP before kubeadm init
- Wait for VIP to be bound before proceeding
- Generate static pod manifest for kube-vip
- Stop bootstrap kube-vip after API server is healthy (static pod takes over)
- Add kube-vip logs output if VIP fails to bind
2026-03-02 20:39:28 +00:00
8b15f061bc Merge pull request 'fix: skip kubeadm wait-control-plane phase, wait for VIP manually' (#92) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m51s
Reviewed-on: #92
2026-03-02 19:42:56 +00:00