368 Commits

Author SHA1 Message Date
5bfc135350 Merge pull request 'fix: ignore stale SSH host keys for ephemeral homelab VMs' (#130) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m24s
Reviewed-on: #130
2026-03-09 03:45:11 +00:00
63213a4bc3 fix: ignore stale SSH host keys for ephemeral homelab VMs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Fresh destroy/recreate cycles change VM host keys, which was breaking bootstrap after rebuilds. Use a disposable known-hosts policy in the controller SSH options so automation does not fail on expected key rotation.
2026-03-09 03:16:18 +00:00
e4243c7667 Merge pull request 'fix: keep DHCP enabled by default on template VM' (#129) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 1h50m42s
Reviewed-on: #129
2026-03-08 22:03:17 +00:00
33bb0ffb17 fix: keep DHCP enabled by default on template VM
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
The template machine can lose connectivity when rebuilt directly because it has no cloud-init network data during template maintenance. Restore DHCP as the default for the template itself while keeping cloud-init + networkd enabled so cloned VMs can still consume injected network settings.
2026-03-08 20:12:03 +00:00
7434a65590 Merge pull request 'stage' (#128) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m54s
Reviewed-on: #128
2026-03-08 18:06:46 +00:00
cd8e538c51 ci: switch checkout action source away from gitea.com mirror
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
The gitea.com checkout action mirror is timing out during workflow startup. Use actions/checkout@v4 directly so jobs do not fail before any repository logic runs.
2026-03-08 13:36:21 +00:00
808c290c71 chore: clarify stale template cloud-init failure message
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 31s
Make SSH bootstrap failures explain the real root cause when fresh clones never accept the injected user/key: the Proxmox source template itself still needs the updated cloud-init-capable NixOS configuration.
2026-03-08 13:16:37 +00:00
15e6471e7e Merge pull request 'fix: enable cloud-init networking in NixOS template' (#127) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m10s
Reviewed-on: #127
2026-03-08 05:33:57 +00:00
79a4c941e5 fix: enable cloud-init networking in NixOS template
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Freshly recreated VMs were reachable but did not accept the injected SSH key, which indicates Proxmox cloud-init settings were not being applied. Enable cloud-init and cloud-init network handling in the base template so static IPs, hostname, ciuser, and SSH keys take effect on first boot.
2026-03-08 05:16:19 +00:00
e9bac70cae Merge pull request 'fix: wait for SSH readiness after VM provisioning' (#126) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m56s
Reviewed-on: #126
2026-03-08 05:04:43 +00:00
4c167f618a fix: wait for SSH readiness after VM provisioning
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.
2026-03-08 05:00:39 +00:00
97295a7071 Merge pull request 'ci: speed up Terraform destroy plan by skipping refresh' (#125) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m0s
Reviewed-on: #125
2026-03-08 04:47:02 +00:00
7bc861b3e8 ci: speed up Terraform destroy plan by skipping refresh
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false for destroy workflows so manual NUKE runs do not spend minutes refreshing Proxmox VM state before building the destroy plan.
2026-03-08 04:37:52 +00:00
6ca189b32c Merge pull request 'fix: vendor Flannel manifest and harden CNI bootstrap timing' (#124) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 15m11s
Reviewed-on: #124
2026-03-08 04:10:47 +00:00
b7b364a112 fix: vendor Flannel manifest and harden CNI bootstrap timing
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Stop depending on GitHub during cluster bring-up by shipping the Flannel manifest in-repo, ensure required host paths exist on NixOS nodes, and wait/retry against a stable API before applying the CNI. This removes the TLS handshake timeout failure mode and makes early network bootstrap deterministic.
2026-03-08 03:24:16 +00:00
2aa9950f59 Merge pull request 'fix: add mount utility to kubelet service PATH' (#123) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m10s
Reviewed-on: #123
2026-03-08 02:16:23 +00:00
bd866f7dac fix: add mount utility to kubelet service PATH
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Flannel pods were stuck because kubelet could not execute mount for projected service account volumes on NixOS. Add util-linux to the kubelet systemd PATH so mount is available during volume setup.
2026-03-07 14:18:20 +00:00
c1f86483ad Merge pull request 'debug: print detailed Flannel pod diagnostics on rollout timeout' (#122) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m50s
Reviewed-on: #122
2026-03-07 12:31:43 +00:00
0cce4bcf72 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-07 12:22:01 +00:00
065567210e debug: print detailed Flannel pod diagnostics on rollout timeout
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
When kube-flannel daemonset rollout stalls, print pod descriptions and per-container logs for the init containers and main flannel container so the next failure shows the actual cause instead of only Init:0/2.
2026-03-07 12:19:21 +00:00
c5f0b1ac37 Merge pull request 'stage' (#121) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 30m28s
Reviewed-on: #121
2026-03-07 01:01:38 +00:00
e740d47011 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-07 00:57:47 +00:00
d9d3976c4c fix: use self-contained Terraform variable validations
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Terraform variable validation blocks can only reference the variable under validation. Replace count-based checks with fixed-length validations for the current 3 control planes and 3 workers.
2026-03-07 00:54:51 +00:00
a0b07816b9 refactor: simplify homelab bootstrap around static IPs and fresh runs
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 10s
Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.
2026-03-07 00:52:35 +00:00
d964ff8b50 Merge pull request 'fix: point Cilium directly at API server and print rollout diagnostics' (#120) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m43s
Reviewed-on: #120
2026-03-05 01:25:52 +00:00
e06b2c692e fix: point Cilium directly at API server and print rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Set Cilium k8sServiceHost/k8sServicePort to the primary control-plane API endpoint to avoid in-cluster service routing dependency during early bootstrap. Also print cilium daemonset/pod/log diagnostics when rollout times out.
2026-03-05 01:21:21 +00:00
c48bbddef3 Merge pull request 'fix: stabilize Cilium install defaults and add rollout diagnostics' (#119) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m43s
Reviewed-on: #119
2026-03-05 00:52:04 +00:00
ca54c44fa4 fix: stabilize Cilium install defaults and add rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set Cilium kubeProxyReplacement from env (default false for homelab stability) and collect cilium daemonset/pod/log diagnostics when rollout times out during verification.
2026-03-05 00:48:41 +00:00
8bda08be07 Merge pull request 'fix: hard-reset nodes before kubeadm join retries' (#118) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 29m30s
Reviewed-on: #118
2026-03-05 00:16:31 +00:00
0778de9719 fix: hard-reset nodes before kubeadm join retries
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Before control-plane and worker joins, remove stale kubelet/kubernetes identity files and run kubeadm reset -f. This prevents preflight failures like FileAvailable--etc-kubernetes-kubelet.conf during repeated reconcile attempts.
2026-03-04 23:38:15 +00:00
92f0658995 Merge pull request 'fix: add heuristic SSH inventory fallback for generic hostnames' (#117) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m52s
Reviewed-on: #117
2026-03-04 23:13:08 +00:00
fc4eb1bc6e fix: add heuristic SSH inventory fallback for generic hostnames
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When Proxmox guest-agent IPs are empty and SSH discovery returns duplicate generic hostnames (e.g. flex), assign remaining missing nodes from unmatched SSH-reachable IPs in deterministic order. Also emit SSH-reachable IP diagnostics on failure.
2026-03-04 23:07:45 +00:00
4b017364c8 Merge pull request 'fix: wait for Cilium and node readiness before marking bootstrap success' (#116) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m47s
Reviewed-on: #116
2026-03-04 22:57:39 +00:00
a70de061b0 fix: wait for Cilium and node readiness before marking bootstrap success
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.
2026-03-04 22:26:43 +00:00
9d98f56725 Merge pull request 'fix: add join preflight ignores for homelab control planes' (#115) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 44m43s
Reviewed-on: #115
2026-03-04 21:13:02 +00:00
5ddd00f711 fix: add join preflight ignores for homelab control planes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.
2026-03-04 21:09:27 +00:00
5af4021228 Merge pull request 'fix: require kubelet kubeconfig before starting service' (#114) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m56s
Reviewed-on: #114
2026-03-04 20:46:48 +00:00
034869347a fix: require kubelet kubeconfig before starting service
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Inline kubelet bootstrap/kubeconfig flags in ExecStart and gate startup on /etc/kubernetes/*kubelet.conf in addition to config.yaml. This prevents kubelet entering standalone mode with webhook auth enabled when no client config is present.
2026-03-04 20:45:47 +00:00
50d0d99332 Merge pull request 'stage' (#113) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m7s
Reviewed-on: #113
2026-03-04 19:32:40 +00:00
f0093deedc fix: avoid assigning control-plane VIP as node SSH address
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 15s
Exclude the configured VIP suffix from subnet scans and prefer non-VIP IPs when multiple SSH endpoints resolve to the same node. This prevents cp-1 being discovered as .250 and later failing SSH commands against the floating VIP.
2026-03-04 19:26:37 +00:00
6b6ca021c9 fix: add kubelet bootstrap kubeconfig args to systemd unit
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Include KUBELET_KUBECONFIG_ARGS in kubelet ExecStart so kubelet can authenticate with bootstrap-kubelet.conf/kubelet.conf and register node objects during kubeadm init.
2026-03-04 19:26:07 +00:00
c034f7975c Merge pull request 'stage' (#112) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 28m53s
Reviewed-on: #112
2026-03-04 18:51:53 +00:00
90ef0ec33f Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-03-04 18:42:22 +00:00
ba6cf42c04 fix: restart kubelet during CRISocket recovery and add registration diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When kubeadm init fails at upload-config/kubelet due missing node object, explicitly restart kubelet to ensure bootstrap flags are loaded before waiting for node registration. Add kubelet flag dump and focused registration log output to surface auth/cert errors.
2026-03-04 18:37:50 +00:00
3cd0c70727 fix: stop overriding kubelet config in kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Remove custom KubeletConfiguration from init config so kubeadm uses default kubelet authn/authz settings and bootstrap registration path. This avoids the standalone-style kubelet behavior where the node never appears in the API.
2026-03-04 18:35:34 +00:00
3281ebd216 Merge pull request 'fix: recover from kubeadm CRISocket node-registration race' (#111) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m6s
Reviewed-on: #111
2026-03-04 03:03:17 +00:00
d2dd6105a6 fix: recover from kubeadm CRISocket node-registration race
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Handle kubeadm init failures where upload-config/kubelet runs before the node object exists. When that specific error occurs, wait for cp-1 registration and run upload-config kubelet phase explicitly instead of aborting immediately.
2026-03-04 03:00:34 +00:00
981afc509a Merge pull request 'fix: use kubeadm v1beta4 list format for kubeletExtraArgs' (#110) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m48s
Reviewed-on: #110
2026-03-04 02:32:22 +00:00
b3c975bd73 fix: use kubeadm v1beta4 list format for kubeletExtraArgs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
kubeadm v1beta4 expects nodeRegistration.kubeletExtraArgs as a list of name/value args, not a map. Switch hostname-override to the correct structure so init config unmarshals successfully.
2026-03-04 02:00:07 +00:00
8aab666fad Merge pull request 'fix: hard reset kubelet identity before kubeadm init' (#109) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 12m25s
Reviewed-on: #109
2026-03-04 01:42:55 +00:00