Commit Graph

319 Commits

Author SHA1 Message Date
8aab666fad Merge pull request 'fix: hard reset kubelet identity before kubeadm init' (#109) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 12m25s
Reviewed-on: #109
2026-03-04 01:42:55 +00:00
308a2fd4b7 fix: hard reset kubelet identity before kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear kubelet cert/bootstrap artifacts after reset and force hostname override in kubeadm nodeRegistration so the node consistently registers as cp-1 instead of inheriting stale template identity.
2026-03-04 01:35:41 +00:00
3fd7ed48b1 Merge pull request 'fix: pin kubeadm init node identity to flake hostname' (#108) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 15m22s
Reviewed-on: #108
2026-03-04 01:18:51 +00:00
0cc0de2aea fix: pin kubeadm init node identity to flake hostname
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set hostname before init and inject nodeRegistration.name into kubeadm InitConfiguration so cp-1 registers as the expected node (cp-1) instead of inheriting the template hostname. This fixes upload-config/kubelet failures caused by node lookup for k8s-base-template.
2026-03-04 01:17:44 +00:00
99458ca829 Merge pull request 'fix: force fresh kubeadm init after rebuild and make kubelet enable-able' (#107) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m1s
Reviewed-on: #107
2026-03-04 00:56:30 +00:00
422b7d7f23 fix: force fresh kubeadm init after rebuild and make kubelet enable-able
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Always re-run primary init when reconcile performs node rebuilds to avoid stale/partial cluster state causing join preflight failures. Also add wantedBy for kubelet so systemctl enable works as expected during join/init flows.
2026-03-04 00:55:20 +00:00
adc8a620f4 Merge pull request 'fix: force fresh bootstrap stages after rebuild and stabilize join node identity' (#106) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m28s
Reviewed-on: #106
2026-03-04 00:32:06 +00:00
3ebeb121b4 fix: force fresh bootstrap stages after rebuild and stabilize join node identity
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear completed bootstrap stage checkpoints whenever nodes are rebuilt so reconcile does not skip required init/cni/join work on fresh hosts. Also pass explicit --node-name for control-plane and worker joins, and ensure kubelet is enabled before join commands run.
2026-03-04 00:26:37 +00:00
f11aadf79c Merge pull request 'fix: map SSH-discovered nodes by VMID when hostnames are generic' (#105) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 27m43s
Reviewed-on: #105
2026-03-03 23:37:45 +00:00
b4265a649e fix: map SSH-discovered nodes by VMID when hostnames are generic
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Some freshly cloned VMs still report template/generic hostnames during discovery. Probe DMI product serial over SSH and map it to Terraform VMIDs so cp-2/cp-3/wk-2 can be resolved even before hostname reconciliation.
2026-03-03 22:16:35 +00:00
09d2f56967 Merge pull request 'fix: make SSH inventory discovery more reliable on CI' (#104) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m46s
Reviewed-on: #104
2026-03-03 21:45:57 +00:00
9ae8eb6134 fix: make SSH inventory discovery more reliable on CI
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Increase default SSH timeout, reduce scan concurrency, and add a second slower scan pass to avoid transient misses on busy runners. Also print discovered hostnames to improve failure diagnostics when node-name matching fails.
2026-03-03 21:08:29 +00:00
f2b9da8a59 Merge pull request 'fix: run Cilium install with sudo and explicit kubeconfig' (#103) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 3m22s
Reviewed-on: #103
2026-03-03 08:56:49 +00:00
a66ae788f6 fix: run Cilium install with sudo and explicit kubeconfig
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Use sudo for helm/kubectl on cp-1 and pass /etc/kubernetes/admin.conf so controller can install Cilium without permission errors.
2026-03-03 08:55:22 +00:00
5fa96e27d7 Merge pull request 'fix: ensure kubelet is enabled for kubeadm init node registration' (#102) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 10m43s
Reviewed-on: #102
2026-03-03 01:13:47 +00:00
cbb8358ce6 fix: ensure kubelet is enabled for kubeadm init node registration
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Enable kubelet before kubeadm init and stop forcing kubelet out of wantedBy so kubeadm can reliably register the node during upload-config/kubelet. Also clear stale kubelet config files during remote prep to avoid restart-loop leftovers.
2026-03-03 01:04:50 +00:00
31017b5c3e Merge pull request 'fix: rebuild nodes by default on reconcile' (#101) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 13m53s
Reviewed-on: #101
2026-03-03 00:46:26 +00:00
a16112a87a fix: rebuild nodes by default on reconcile
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Do not skip node rebuilds unless SKIP_REBUILD=1 is explicitly set. This prevents stale remote helper scripts from being reused across retries after bootstrap logic changes.
2026-03-03 00:34:55 +00:00
f53d087c9c Merge pull request 'fix: use valid kube-vip log flag value' (#100) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m29s
Reviewed-on: #100
2026-03-03 00:26:08 +00:00
51b56e562e fix: use valid kube-vip log flag value
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
kube-vip expects an unsigned integer for --log. Replace --log -4 with --log 4 so manifest generation no longer fails during bootstrap.
2026-03-03 00:25:25 +00:00
0e0643a6fc Merge pull request 'refactor: add Python bootstrap controller with resumable state' (#99) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m46s
Reviewed-on: #99
2026-03-03 00:10:19 +00:00
6fecfb3ee6 refactor: add Python bootstrap controller with resumable state
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Introduce a clean orchestration layer in nixos/kubeadm/bootstrap/controller.py and slim rebuild-and-bootstrap.sh into a thin wrapper. The controller now owns preflight, rebuild, init, CNI install, join, and verify stages with persisted checkpoints on cp-1 plus a local state copy for CI debugging.
2026-03-03 00:09:10 +00:00
7a0016b003 Merge pull request 'fix: preserve kube-vip mount path and only swap hostPath to super-admin' (#98) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #98
2026-03-03 00:00:48 +00:00
355273add5 fix: preserve kube-vip mount path and only swap hostPath to super-admin
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
The previous replacement changed both mountPath and hostPath, causing kube-vip to lose its expected in-container kubeconfig path and exit. Keep mountPath at /etc/kubernetes/admin.conf, swap only hostPath during bootstrap, and enable kube-vip debug log level.
2026-03-02 23:59:41 +00:00
e5162c220c Merge pull request 'fix: bootstrap kube-vip without leader election' (#97) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m12s
Reviewed-on: #97
2026-03-02 23:31:52 +00:00
262e9eb4d7 fix: bootstrap kube-vip without leader election
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Run first-control-plane kube-vip manifest without --leaderElection so VIP can bind before API/RBAC are fully available. Also print kube-vip container exit details on failure.
2026-03-02 23:28:44 +00:00
84513f4bb8 Merge pull request 'fix: run kube-vip in control-plane-only mode during bootstrap' (#96) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m50s
Reviewed-on: #96
2026-03-02 22:53:22 +00:00
c445638d4a fix: run kube-vip in control-plane-only mode during bootstrap
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Remove --services from kube-vip static pod manifests for init/join. Service LB mode can crash-loop during kubeadm bootstrap before cluster RBAC is ready, which prevented VIP binding.
2026-03-02 22:52:44 +00:00
678b383063 Merge pull request 'stage' (#95) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m14s
Reviewed-on: #95
2026-03-02 22:33:27 +00:00
880bbcceca ci: speed up Terraform plan by skipping refresh in pipelines
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false in plan/apply workflows to avoid slow Proxmox state refresh on every push. This keeps CI fast while preserving apply behavior from the generated plan.
2026-03-02 22:32:10 +00:00
190dc2e095 fix: restore compatibility with older nixos-rebuild sudo flag
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Use --use-remote-sudo in rebuild script since the runner's nixos-rebuild does not support --sudo yet.
2026-03-02 22:30:38 +00:00
d86b0a32a2 Merge pull request 'fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency' (#94) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m3s
Reviewed-on: #94
2026-03-02 22:13:28 +00:00
a81799a2b5 fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Move kubeadm reset ahead of kube-vip manifest generation, use super-admin.conf during bootstrap for kube-vip, and restore admin.conf after init. Also switch nixos-rebuild to --sudo and make QEMU guest agent optional so Terraform plan can skip slow guest-agent refreshes when it is not installed.
2026-03-02 22:09:10 +00:00
6c7182b8f5 Merge pull request 'fix: run kube-vip daemon before kubeadm init' (#93) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m52s
Reviewed-on: #93
2026-03-02 21:02:11 +00:00
46c0786e57 fix: run kube-vip daemon before kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m8s
- Start kube-vip as a detached container to claim VIP before kubeadm init
- Wait for VIP to be bound before proceeding
- Generate static pod manifest for kube-vip
- Stop bootstrap kube-vip after API server is healthy (static pod takes over)
- Add kube-vip logs output if VIP fails to bind
2026-03-02 20:39:28 +00:00
8b15f061bc Merge pull request 'fix: skip kubeadm wait-control-plane phase, wait for VIP manually' (#92) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m51s
Reviewed-on: #92
2026-03-02 19:42:56 +00:00
1af45ca51e fix: skip kubeadm wait-control-plane phase, wait for VIP manually
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Use --skip-phases=wait-control-plane to avoid 4-minute timeout
- Wait for kube-vip to bind VIP before checking API server health
- Add kube-vip logs and VIP status to debug output
2026-03-02 19:37:06 +00:00
c91d28a5dc Merge pull request 'fix: add image pre-pull and debug output for kubeadm init' (#91) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m27s
Reviewed-on: #91
2026-03-02 18:36:46 +00:00
533f5a91e0 fix: add image pre-pull and debug output for kubeadm init
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Pre-pull k8s control plane images before init to speed up startup
- Add crictl pods and crictl ps -a output on failure for debugging
2026-03-02 18:35:41 +00:00
cfdfab3ec0 Merge pull request 'fix: disable webhook authz and clean stale kubelet configs' (#90) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m1s
Reviewed-on: #90
2026-03-02 18:01:33 +00:00
c061dda31d fix: disable webhook authz and clean stale kubelet configs
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Add authorization.mode: AlwaysAllow to KubeletConfiguration
- Remove stale kubelet config.yaml before unmasking in all kubeadm scripts
- This prevents 'no client provided, cannot use webhook authorization' error
2026-03-02 17:59:31 +00:00
cec60c003c Merge pull request 'fix: disable kubelet webhook auth in kubeadm init config' (#89) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m1s
Reviewed-on: #89
2026-03-02 16:50:31 +00:00
fb21fbef4f fix: disable kubelet webhook auth in kubeadm init config
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Use explicit kubeadm config file with KubeletConfiguration
- Disable webhook authentication which was causing 'no client provided' error
- Add ConditionPathExists to kubelet systemd unit
2026-03-02 16:49:21 +00:00
6cc57f8b0e Merge pull request 'fix: kubelet directories and containerd readiness' (#88) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m54s
Reviewed-on: #88
2026-03-02 14:45:54 +00:00
1b76e07326 fix: kubelet directories and containerd readiness
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Create /var/lib/kubelet and /var/lib/kubelet/pki directories via tmpfiles
- Ensure containerd is running before kubeadm init
- Add kubelet logs output on kubeadm init failure for debugging
2026-03-02 14:44:47 +00:00
9d17dd17cc Merge pull request 'fix: remove kubelet ConditionPathExists, add daemon-reload' (#87) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m5s
Reviewed-on: #87
2026-03-02 14:01:06 +00:00
db72dcab75 fix: remove kubelet ConditionPathExists, add daemon-reload
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Remove ConditionPathExists from kubelet service definition as it
  prevents kubelet from starting when managed by kubeadm
- Add systemctl daemon-reload after unmasking in all kubeadm scripts
- Add reset-failed for consistent state cleanup
2026-03-02 13:58:49 +00:00
23d61a6308 Merge pull request 'fix: mask kubelet before rebuild, unmask in kubeadm helpers' (#86) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m58s
Reviewed-on: #86
2026-03-02 12:54:37 +00:00
d42e83358c fix: mask kubelet before rebuild, unmask in kubeadm helpers
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Mask kubelet service entirely before nixos-rebuild to prevent systemd
  from restarting it during switch
- Unmask kubelet in th-kubeadm-init/join scripts before starting
2026-03-02 12:44:40 +00:00
198c147b79 Merge pull request 'fix: prevent kubelet auto-start during rebuild' (#85) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m58s
Reviewed-on: #85
2026-03-02 12:14:38 +00:00