238 Commits

Author SHA1 Message Date
5bfc135350 Merge pull request 'fix: ignore stale SSH host keys for ephemeral homelab VMs' (#130) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m24s
Reviewed-on: #130
2026-03-09 03:45:11 +00:00
63213a4bc3 fix: ignore stale SSH host keys for ephemeral homelab VMs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Fresh destroy/recreate cycles change VM host keys, which was breaking bootstrap after rebuilds. Use a disposable known-hosts policy in the controller SSH options so automation does not fail on expected key rotation.
2026-03-09 03:16:18 +00:00
e4243c7667 Merge pull request 'fix: keep DHCP enabled by default on template VM' (#129) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 1h50m42s
Reviewed-on: #129
2026-03-08 22:03:17 +00:00
33bb0ffb17 fix: keep DHCP enabled by default on template VM
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
The template machine can lose connectivity when rebuilt directly because it has no cloud-init network data during template maintenance. Restore DHCP as the default for the template itself while keeping cloud-init + networkd enabled so cloned VMs can still consume injected network settings.
2026-03-08 20:12:03 +00:00
7434a65590 Merge pull request 'stage' (#128) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m54s
Reviewed-on: #128
2026-03-08 18:06:46 +00:00
cd8e538c51 ci: switch checkout action source away from gitea.com mirror
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
The gitea.com checkout action mirror is timing out during workflow startup. Use actions/checkout@v4 directly so jobs do not fail before any repository logic runs.
2026-03-08 13:36:21 +00:00
808c290c71 chore: clarify stale template cloud-init failure message
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 31s
Make SSH bootstrap failures explain the real root cause when fresh clones never accept the injected user/key: the Proxmox source template itself still needs the updated cloud-init-capable NixOS configuration.
2026-03-08 13:16:37 +00:00
15e6471e7e Merge pull request 'fix: enable cloud-init networking in NixOS template' (#127) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m10s
Reviewed-on: #127
2026-03-08 05:33:57 +00:00
79a4c941e5 fix: enable cloud-init networking in NixOS template
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Freshly recreated VMs were reachable but did not accept the injected SSH key, which indicates Proxmox cloud-init settings were not being applied. Enable cloud-init and cloud-init network handling in the base template so static IPs, hostname, ciuser, and SSH keys take effect on first boot.
2026-03-08 05:16:19 +00:00
e9bac70cae Merge pull request 'fix: wait for SSH readiness after VM provisioning' (#126) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m56s
Reviewed-on: #126
2026-03-08 05:04:43 +00:00
4c167f618a fix: wait for SSH readiness after VM provisioning
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.
2026-03-08 05:00:39 +00:00
97295a7071 Merge pull request 'ci: speed up Terraform destroy plan by skipping refresh' (#125) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m0s
Reviewed-on: #125
2026-03-08 04:47:02 +00:00
7bc861b3e8 ci: speed up Terraform destroy plan by skipping refresh
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false for destroy workflows so manual NUKE runs do not spend minutes refreshing Proxmox VM state before building the destroy plan.
2026-03-08 04:37:52 +00:00
6ca189b32c Merge pull request 'fix: vendor Flannel manifest and harden CNI bootstrap timing' (#124) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 15m11s
Reviewed-on: #124
2026-03-08 04:10:47 +00:00
b7b364a112 fix: vendor Flannel manifest and harden CNI bootstrap timing
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Stop depending on GitHub during cluster bring-up by shipping the Flannel manifest in-repo, ensure required host paths exist on NixOS nodes, and wait/retry against a stable API before applying the CNI. This removes the TLS handshake timeout failure mode and makes early network bootstrap deterministic.
2026-03-08 03:24:16 +00:00
2aa9950f59 Merge pull request 'fix: add mount utility to kubelet service PATH' (#123) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m10s
Reviewed-on: #123
2026-03-08 02:16:23 +00:00
bd866f7dac fix: add mount utility to kubelet service PATH
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Flannel pods were stuck because kubelet could not execute mount for projected service account volumes on NixOS. Add util-linux to the kubelet systemd PATH so mount is available during volume setup.
2026-03-07 14:18:20 +00:00
c1f86483ad Merge pull request 'debug: print detailed Flannel pod diagnostics on rollout timeout' (#122) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m50s
Reviewed-on: #122
2026-03-07 12:31:43 +00:00
0cce4bcf72 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-07 12:22:01 +00:00
065567210e debug: print detailed Flannel pod diagnostics on rollout timeout
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
When kube-flannel daemonset rollout stalls, print pod descriptions and per-container logs for the init containers and main flannel container so the next failure shows the actual cause instead of only Init:0/2.
2026-03-07 12:19:21 +00:00
c5f0b1ac37 Merge pull request 'stage' (#121) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 30m28s
Reviewed-on: #121
2026-03-07 01:01:38 +00:00
e740d47011 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-07 00:57:47 +00:00
d9d3976c4c fix: use self-contained Terraform variable validations
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Terraform variable validation blocks can only reference the variable under validation. Replace count-based checks with fixed-length validations for the current 3 control planes and 3 workers.
2026-03-07 00:54:51 +00:00
a0b07816b9 refactor: simplify homelab bootstrap around static IPs and fresh runs
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 10s
Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.
2026-03-07 00:52:35 +00:00
d964ff8b50 Merge pull request 'fix: point Cilium directly at API server and print rollout diagnostics' (#120) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m43s
Reviewed-on: #120
2026-03-05 01:25:52 +00:00
e06b2c692e fix: point Cilium directly at API server and print rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Set Cilium k8sServiceHost/k8sServicePort to the primary control-plane API endpoint to avoid in-cluster service routing dependency during early bootstrap. Also print cilium daemonset/pod/log diagnostics when rollout times out.
2026-03-05 01:21:21 +00:00
c48bbddef3 Merge pull request 'fix: stabilize Cilium install defaults and add rollout diagnostics' (#119) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m43s
Reviewed-on: #119
2026-03-05 00:52:04 +00:00
ca54c44fa4 fix: stabilize Cilium install defaults and add rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set Cilium kubeProxyReplacement from env (default false for homelab stability) and collect cilium daemonset/pod/log diagnostics when rollout times out during verification.
2026-03-05 00:48:41 +00:00
8bda08be07 Merge pull request 'fix: hard-reset nodes before kubeadm join retries' (#118) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 29m30s
Reviewed-on: #118
2026-03-05 00:16:31 +00:00
0778de9719 fix: hard-reset nodes before kubeadm join retries
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Before control-plane and worker joins, remove stale kubelet/kubernetes identity files and run kubeadm reset -f. This prevents preflight failures like FileAvailable--etc-kubernetes-kubelet.conf during repeated reconcile attempts.
2026-03-04 23:38:15 +00:00
92f0658995 Merge pull request 'fix: add heuristic SSH inventory fallback for generic hostnames' (#117) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m52s
Reviewed-on: #117
2026-03-04 23:13:08 +00:00
fc4eb1bc6e fix: add heuristic SSH inventory fallback for generic hostnames
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When Proxmox guest-agent IPs are empty and SSH discovery returns duplicate generic hostnames (e.g. flex), assign remaining missing nodes from unmatched SSH-reachable IPs in deterministic order. Also emit SSH-reachable IP diagnostics on failure.
2026-03-04 23:07:45 +00:00
4b017364c8 Merge pull request 'fix: wait for Cilium and node readiness before marking bootstrap success' (#116) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m47s
Reviewed-on: #116
2026-03-04 22:57:39 +00:00
a70de061b0 fix: wait for Cilium and node readiness before marking bootstrap success
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.
2026-03-04 22:26:43 +00:00
9d98f56725 Merge pull request 'fix: add join preflight ignores for homelab control planes' (#115) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 44m43s
Reviewed-on: #115
2026-03-04 21:13:02 +00:00
5ddd00f711 fix: add join preflight ignores for homelab control planes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.
2026-03-04 21:09:27 +00:00
5af4021228 Merge pull request 'fix: require kubelet kubeconfig before starting service' (#114) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m56s
Reviewed-on: #114
2026-03-04 20:46:48 +00:00
034869347a fix: require kubelet kubeconfig before starting service
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Inline kubelet bootstrap/kubeconfig flags in ExecStart and gate startup on /etc/kubernetes/*kubelet.conf in addition to config.yaml. This prevents kubelet entering standalone mode with webhook auth enabled when no client config is present.
2026-03-04 20:45:47 +00:00
50d0d99332 Merge pull request 'stage' (#113) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m7s
Reviewed-on: #113
2026-03-04 19:32:40 +00:00
f0093deedc fix: avoid assigning control-plane VIP as node SSH address
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 15s
Exclude the configured VIP suffix from subnet scans and prefer non-VIP IPs when multiple SSH endpoints resolve to the same node. This prevents cp-1 being discovered as .250 and later failing SSH commands against the floating VIP.
2026-03-04 19:26:37 +00:00
6b6ca021c9 fix: add kubelet bootstrap kubeconfig args to systemd unit
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Include KUBELET_KUBECONFIG_ARGS in kubelet ExecStart so kubelet can authenticate with bootstrap-kubelet.conf/kubelet.conf and register node objects during kubeadm init.
2026-03-04 19:26:07 +00:00
c034f7975c Merge pull request 'stage' (#112) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 28m53s
Reviewed-on: #112
2026-03-04 18:51:53 +00:00
90ef0ec33f Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-03-04 18:42:22 +00:00
ba6cf42c04 fix: restart kubelet during CRISocket recovery and add registration diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When kubeadm init fails at upload-config/kubelet due missing node object, explicitly restart kubelet to ensure bootstrap flags are loaded before waiting for node registration. Add kubelet flag dump and focused registration log output to surface auth/cert errors.
2026-03-04 18:37:50 +00:00
3cd0c70727 fix: stop overriding kubelet config in kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Remove custom KubeletConfiguration from init config so kubeadm uses default kubelet authn/authz settings and bootstrap registration path. This avoids the standalone-style kubelet behavior where the node never appears in the API.
2026-03-04 18:35:34 +00:00
3281ebd216 Merge pull request 'fix: recover from kubeadm CRISocket node-registration race' (#111) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m6s
Reviewed-on: #111
2026-03-04 03:03:17 +00:00
d2dd6105a6 fix: recover from kubeadm CRISocket node-registration race
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Handle kubeadm init failures where upload-config/kubelet runs before the node object exists. When that specific error occurs, wait for cp-1 registration and run upload-config kubelet phase explicitly instead of aborting immediately.
2026-03-04 03:00:34 +00:00
981afc509a Merge pull request 'fix: use kubeadm v1beta4 list format for kubeletExtraArgs' (#110) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m48s
Reviewed-on: #110
2026-03-04 02:32:22 +00:00
b3c975bd73 fix: use kubeadm v1beta4 list format for kubeletExtraArgs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
kubeadm v1beta4 expects nodeRegistration.kubeletExtraArgs as a list of name/value args, not a map. Switch hostname-override to the correct structure so init config unmarshals successfully.
2026-03-04 02:00:07 +00:00
8aab666fad Merge pull request 'fix: hard reset kubelet identity before kubeadm init' (#109) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 12m25s
Reviewed-on: #109
2026-03-04 01:42:55 +00:00
308a2fd4b7 fix: hard reset kubelet identity before kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear kubelet cert/bootstrap artifacts after reset and force hostname override in kubeadm nodeRegistration so the node consistently registers as cp-1 instead of inheriting stale template identity.
2026-03-04 01:35:41 +00:00
3fd7ed48b1 Merge pull request 'fix: pin kubeadm init node identity to flake hostname' (#108) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 15m22s
Reviewed-on: #108
2026-03-04 01:18:51 +00:00
0cc0de2aea fix: pin kubeadm init node identity to flake hostname
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set hostname before init and inject nodeRegistration.name into kubeadm InitConfiguration so cp-1 registers as the expected node (cp-1) instead of inheriting the template hostname. This fixes upload-config/kubelet failures caused by node lookup for k8s-base-template.
2026-03-04 01:17:44 +00:00
99458ca829 Merge pull request 'fix: force fresh kubeadm init after rebuild and make kubelet enable-able' (#107) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m1s
Reviewed-on: #107
2026-03-04 00:56:30 +00:00
422b7d7f23 fix: force fresh kubeadm init after rebuild and make kubelet enable-able
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Always re-run primary init when reconcile performs node rebuilds to avoid stale/partial cluster state causing join preflight failures. Also add wantedBy for kubelet so systemctl enable works as expected during join/init flows.
2026-03-04 00:55:20 +00:00
adc8a620f4 Merge pull request 'fix: force fresh bootstrap stages after rebuild and stabilize join node identity' (#106) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m28s
Reviewed-on: #106
2026-03-04 00:32:06 +00:00
3ebeb121b4 fix: force fresh bootstrap stages after rebuild and stabilize join node identity
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Clear completed bootstrap stage checkpoints whenever nodes are rebuilt so reconcile does not skip required init/cni/join work on fresh hosts. Also pass explicit --node-name for control-plane and worker joins, and ensure kubelet is enabled before join commands run.
2026-03-04 00:26:37 +00:00
f11aadf79c Merge pull request 'fix: map SSH-discovered nodes by VMID when hostnames are generic' (#105) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 27m43s
Reviewed-on: #105
2026-03-03 23:37:45 +00:00
b4265a649e fix: map SSH-discovered nodes by VMID when hostnames are generic
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Some freshly cloned VMs still report template/generic hostnames during discovery. Probe DMI product serial over SSH and map it to Terraform VMIDs so cp-2/cp-3/wk-2 can be resolved even before hostname reconciliation.
2026-03-03 22:16:35 +00:00
09d2f56967 Merge pull request 'fix: make SSH inventory discovery more reliable on CI' (#104) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m46s
Reviewed-on: #104
2026-03-03 21:45:57 +00:00
9ae8eb6134 fix: make SSH inventory discovery more reliable on CI
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Increase default SSH timeout, reduce scan concurrency, and add a second slower scan pass to avoid transient misses on busy runners. Also print discovered hostnames to improve failure diagnostics when node-name matching fails.
2026-03-03 21:08:29 +00:00
f2b9da8a59 Merge pull request 'fix: run Cilium install with sudo and explicit kubeconfig' (#103) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 3m22s
Reviewed-on: #103
2026-03-03 08:56:49 +00:00
a66ae788f6 fix: run Cilium install with sudo and explicit kubeconfig
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Use sudo for helm/kubectl on cp-1 and pass /etc/kubernetes/admin.conf so controller can install Cilium without permission errors.
2026-03-03 08:55:22 +00:00
5fa96e27d7 Merge pull request 'fix: ensure kubelet is enabled for kubeadm init node registration' (#102) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 10m43s
Reviewed-on: #102
2026-03-03 01:13:47 +00:00
cbb8358ce6 fix: ensure kubelet is enabled for kubeadm init node registration
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Enable kubelet before kubeadm init and stop forcing kubelet out of wantedBy so kubeadm can reliably register the node during upload-config/kubelet. Also clear stale kubelet config files during remote prep to avoid restart-loop leftovers.
2026-03-03 01:04:50 +00:00
31017b5c3e Merge pull request 'fix: rebuild nodes by default on reconcile' (#101) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 13m53s
Reviewed-on: #101
2026-03-03 00:46:26 +00:00
a16112a87a fix: rebuild nodes by default on reconcile
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Do not skip node rebuilds unless SKIP_REBUILD=1 is explicitly set. This prevents stale remote helper scripts from being reused across retries after bootstrap logic changes.
2026-03-03 00:34:55 +00:00
f53d087c9c Merge pull request 'fix: use valid kube-vip log flag value' (#100) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m29s
Reviewed-on: #100
2026-03-03 00:26:08 +00:00
51b56e562e fix: use valid kube-vip log flag value
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
kube-vip expects an unsigned integer for --log. Replace --log -4 with --log 4 so manifest generation no longer fails during bootstrap.
2026-03-03 00:25:25 +00:00
0e0643a6fc Merge pull request 'refactor: add Python bootstrap controller with resumable state' (#99) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m46s
Reviewed-on: #99
2026-03-03 00:10:19 +00:00
6fecfb3ee6 refactor: add Python bootstrap controller with resumable state
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Introduce a clean orchestration layer in nixos/kubeadm/bootstrap/controller.py and slim rebuild-and-bootstrap.sh into a thin wrapper. The controller now owns preflight, rebuild, init, CNI install, join, and verify stages with persisted checkpoints on cp-1 plus a local state copy for CI debugging.
2026-03-03 00:09:10 +00:00
7a0016b003 Merge pull request 'fix: preserve kube-vip mount path and only swap hostPath to super-admin' (#98) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #98
2026-03-03 00:00:48 +00:00
355273add5 fix: preserve kube-vip mount path and only swap hostPath to super-admin
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
The previous replacement changed both mountPath and hostPath, causing kube-vip to lose its expected in-container kubeconfig path and exit. Keep mountPath at /etc/kubernetes/admin.conf, swap only hostPath during bootstrap, and enable kube-vip debug log level.
2026-03-02 23:59:41 +00:00
e5162c220c Merge pull request 'fix: bootstrap kube-vip without leader election' (#97) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m12s
Reviewed-on: #97
2026-03-02 23:31:52 +00:00
262e9eb4d7 fix: bootstrap kube-vip without leader election
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Run first-control-plane kube-vip manifest without --leaderElection so VIP can bind before API/RBAC are fully available. Also print kube-vip container exit details on failure.
2026-03-02 23:28:44 +00:00
84513f4bb8 Merge pull request 'fix: run kube-vip in control-plane-only mode during bootstrap' (#96) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m50s
Reviewed-on: #96
2026-03-02 22:53:22 +00:00
c445638d4a fix: run kube-vip in control-plane-only mode during bootstrap
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Remove --services from kube-vip static pod manifests for init/join. Service LB mode can crash-loop during kubeadm bootstrap before cluster RBAC is ready, which prevented VIP binding.
2026-03-02 22:52:44 +00:00
678b383063 Merge pull request 'stage' (#95) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m14s
Reviewed-on: #95
2026-03-02 22:33:27 +00:00
880bbcceca ci: speed up Terraform plan by skipping refresh in pipelines
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false in plan/apply workflows to avoid slow Proxmox state refresh on every push. This keeps CI fast while preserving apply behavior from the generated plan.
2026-03-02 22:32:10 +00:00
190dc2e095 fix: restore compatibility with older nixos-rebuild sudo flag
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Use --use-remote-sudo in rebuild script since the runner's nixos-rebuild does not support --sudo yet.
2026-03-02 22:30:38 +00:00
d86b0a32a2 Merge pull request 'fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency' (#94) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m3s
Reviewed-on: #94
2026-03-02 22:13:28 +00:00
a81799a2b5 fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Move kubeadm reset ahead of kube-vip manifest generation, use super-admin.conf during bootstrap for kube-vip, and restore admin.conf after init. Also switch nixos-rebuild to --sudo and make QEMU guest agent optional so Terraform plan can skip slow guest-agent refreshes when it is not installed.
2026-03-02 22:09:10 +00:00
6c7182b8f5 Merge pull request 'fix: run kube-vip daemon before kubeadm init' (#93) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m52s
Reviewed-on: #93
2026-03-02 21:02:11 +00:00
46c0786e57 fix: run kube-vip daemon before kubeadm init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m8s
- Start kube-vip as a detached container to claim VIP before kubeadm init
- Wait for VIP to be bound before proceeding
- Generate static pod manifest for kube-vip
- Stop bootstrap kube-vip after API server is healthy (static pod takes over)
- Add kube-vip logs output if VIP fails to bind
2026-03-02 20:39:28 +00:00
8b15f061bc Merge pull request 'fix: skip kubeadm wait-control-plane phase, wait for VIP manually' (#92) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m51s
Reviewed-on: #92
2026-03-02 19:42:56 +00:00
1af45ca51e fix: skip kubeadm wait-control-plane phase, wait for VIP manually
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Use --skip-phases=wait-control-plane to avoid 4-minute timeout
- Wait for kube-vip to bind VIP before checking API server health
- Add kube-vip logs and VIP status to debug output
2026-03-02 19:37:06 +00:00
c91d28a5dc Merge pull request 'fix: add image pre-pull and debug output for kubeadm init' (#91) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m27s
Reviewed-on: #91
2026-03-02 18:36:46 +00:00
533f5a91e0 fix: add image pre-pull and debug output for kubeadm init
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Pre-pull k8s control plane images before init to speed up startup
- Add crictl pods and crictl ps -a output on failure for debugging
2026-03-02 18:35:41 +00:00
cfdfab3ec0 Merge pull request 'fix: disable webhook authz and clean stale kubelet configs' (#90) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m1s
Reviewed-on: #90
2026-03-02 18:01:33 +00:00
c061dda31d fix: disable webhook authz and clean stale kubelet configs
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Add authorization.mode: AlwaysAllow to KubeletConfiguration
- Remove stale kubelet config.yaml before unmasking in all kubeadm scripts
- This prevents 'no client provided, cannot use webhook authorization' error
2026-03-02 17:59:31 +00:00
cec60c003c Merge pull request 'fix: disable kubelet webhook auth in kubeadm init config' (#89) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m1s
Reviewed-on: #89
2026-03-02 16:50:31 +00:00
fb21fbef4f fix: disable kubelet webhook auth in kubeadm init config
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Use explicit kubeadm config file with KubeletConfiguration
- Disable webhook authentication which was causing 'no client provided' error
- Add ConditionPathExists to kubelet systemd unit
2026-03-02 16:49:21 +00:00
6cc57f8b0e Merge pull request 'fix: kubelet directories and containerd readiness' (#88) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m54s
Reviewed-on: #88
2026-03-02 14:45:54 +00:00
1b76e07326 fix: kubelet directories and containerd readiness
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Create /var/lib/kubelet and /var/lib/kubelet/pki directories via tmpfiles
- Ensure containerd is running before kubeadm init
- Add kubelet logs output on kubeadm init failure for debugging
2026-03-02 14:44:47 +00:00
9d17dd17cc Merge pull request 'fix: remove kubelet ConditionPathExists, add daemon-reload' (#87) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m5s
Reviewed-on: #87
2026-03-02 14:01:06 +00:00
db72dcab75 fix: remove kubelet ConditionPathExists, add daemon-reload
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Remove ConditionPathExists from kubelet service definition as it
  prevents kubelet from starting when managed by kubeadm
- Add systemctl daemon-reload after unmasking in all kubeadm scripts
- Add reset-failed for consistent state cleanup
2026-03-02 13:58:49 +00:00
23d61a6308 Merge pull request 'fix: mask kubelet before rebuild, unmask in kubeadm helpers' (#86) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m58s
Reviewed-on: #86
2026-03-02 12:54:37 +00:00
d42e83358c fix: mask kubelet before rebuild, unmask in kubeadm helpers
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
- Mask kubelet service entirely before nixos-rebuild to prevent systemd
  from restarting it during switch
- Unmask kubelet in th-kubeadm-init/join scripts before starting
2026-03-02 12:44:40 +00:00
198c147b79 Merge pull request 'fix: prevent kubelet auto-start during rebuild' (#85) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m58s
Reviewed-on: #85
2026-03-02 12:14:38 +00:00
93e43a546f fix: prevent kubelet auto-start during rebuild
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
Add wantedBy = [] to prevent kubelet from being started by multi-user.target
during nixos-rebuild switch. This allows rebuilds to succeed even when the
cluster is in a transitional state. Kubelet will be started by kubeadm
init/join commands instead.
2026-03-02 12:13:05 +00:00
3b03e68f3e Merge pull request 'fix: disable lingering kubelet service before node rebuild' (#84) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m50s
Reviewed-on: #84
2026-03-02 10:09:20 +00:00
ab5cc8b01d fix: disable lingering kubelet service before node rebuild
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-02 10:08:27 +00:00
92759407a6 Merge pull request 'fix: stop auto-enabling kubelet during base node rebuild' (#83) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m4s
Reviewed-on: #83
2026-03-02 09:17:26 +00:00
f65a414959 fix: stop auto-enabling kubelet during base node rebuild
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m8s
2026-03-02 09:13:53 +00:00
03c6d0454a Merge pull request 'fix: gate kubelet startup until kubeadm config exists' (#82) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m56s
Reviewed-on: #82
2026-03-02 08:40:39 +00:00
7c849ed019 fix: gate kubelet startup until kubeadm config exists
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-02 08:39:22 +00:00
b8bd9686d3 Merge pull request 'fix: align kubelet systemd unit with kubeadm flags' (#81) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m42s
Reviewed-on: #81
2026-03-02 03:48:09 +00:00
388b0c4f5d fix: align kubelet systemd unit with kubeadm flags
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-02 03:44:35 +00:00
cfd72fa750 Merge pull request 'fix: ignore kubeadm HTTPProxyCIDR preflight in homelab workflow' (#80) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 28m13s
Reviewed-on: #80
2026-03-02 03:10:37 +00:00
d810547675 fix: ignore kubeadm HTTPProxyCIDR preflight in homelab workflow
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-02 03:06:29 +00:00
3ed3381140 Merge pull request 'fix: run kubeadm init/reset with clean environment' (#79) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m22s
Reviewed-on: #79
2026-03-02 02:39:27 +00:00
9426968cd4 fix: run kubeadm init/reset with clean environment
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-02 02:36:57 +00:00
4569fcd2ea Merge pull request 'fix: harden kubeadm scripts for proxy and preflight issues' (#78) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m33s
Reviewed-on: #78
2026-03-02 02:09:11 +00:00
02a6bca60b fix: harden kubeadm scripts for proxy and preflight issues
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-02 02:02:38 +00:00
f7f3c7df3e Merge pull request 'fix: avoid sudo env loss for kube-vip image reference' (#77) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m59s
Reviewed-on: #77
2026-03-02 01:32:53 +00:00
a098c0aa29 fix: avoid sudo env loss for kube-vip image reference
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m8s
2026-03-02 01:27:44 +00:00
766cd5db4f Merge pull request 'fix: correctly propagate remote command exit status' (#76) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m10s
Reviewed-on: #76
2026-03-02 01:04:44 +00:00
9b03cec23e fix: correctly propagate remote command exit status
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m7s
2026-03-02 00:52:24 +00:00
5fe36d0963 Merge pull request 'chore: trigger workflows' (#75) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m29s
Reviewed-on: #75
2026-03-02 00:18:38 +00:00
c794e07ab2 chore: trigger workflows
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m7s
2026-03-02 00:18:10 +00:00
8103b02883 Merge pull request 'fix: require admin kubeconfig before skipping cp init' (#74) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m40s
Reviewed-on: #74
2026-03-01 23:43:29 +00:00
fd7be1a428 fix: require admin kubeconfig before skipping cp init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m8s
2026-03-01 23:42:56 +00:00
6262f61506 Merge pull request 'fix: make cp-1 init detection and join token generation robust' (#73) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m26s
Reviewed-on: #73
2026-03-01 22:40:10 +00:00
c0b820c92a Merge branch 'master' into stage
Some checks are pending
Terraform Plan / Terraform Plan (push) Waiting to run
2026-03-01 22:40:05 +00:00
f9e7356f94 fix: make cp-1 init detection and join token generation robust
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 9m44s
2026-03-01 21:56:59 +00:00
27185ed17a Merge pull request 'fix: recover when admin kubeconfig is missing on primary control plane' (#72) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m30s
Reviewed-on: #72
2026-03-01 21:30:33 +00:00
9baf35d886 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m7s
2026-03-01 21:30:28 +00:00
a5f0f0a420 fix: recover when admin kubeconfig is missing on primary control plane
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m7s
2026-03-01 20:58:44 +00:00
310d273378 Merge pull request 'fix: use admin kubeconfig for final cluster node check' (#71) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m16s
Reviewed-on: #71
2026-03-01 20:38:17 +00:00
661fbc2ff4 fix: use admin kubeconfig for final cluster node check
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m7s
2026-03-01 20:31:57 +00:00
3b0219f211 Merge pull request 'feat: add SSH-based fallback for kubeadm IP inventory' (#70) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 20m6s
Reviewed-on: #70
2026-03-01 20:07:55 +00:00
3fa227d7c9 feat: add SSH-based fallback for kubeadm IP inventory
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m7s
2026-03-01 19:28:15 +00:00
61db9a26d9 Merge pull request 'fix: retry kubeadm inventory generation until VM IPs appear' (#69) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 12m43s
Reviewed-on: #69
2026-03-01 19:04:05 +00:00
8f915201e3 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m6s
2026-03-01 18:46:59 +00:00
a933341c28 fix: retry kubeadm inventory generation until VM IPs appear
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-01 18:42:18 +00:00
f90e971fab Merge pull request 'fix: fail fast when terraform node IP outputs are empty' (#68) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 10m9s
Reviewed-on: #68
2026-03-01 18:07:20 +00:00
920c0c10b8 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m6s
2026-03-01 18:07:02 +00:00
718a9930e8 fix: fail fast when terraform node IP outputs are empty
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-01 18:01:09 +00:00
a9f6153623 Merge pull request 'fix: auto-detect kube-vip interface and tighten SSH fallback' (#67) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m28s
Reviewed-on: #67
2026-03-01 17:35:34 +00:00
9edb8f807d Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m5s
2026-03-01 17:34:57 +00:00
7ec1ce92cf fix: auto-detect kube-vip interface and tighten SSH fallback
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-01 17:34:09 +00:00
198f0e2910 Merge pull request 'stage' (#66) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 50m2s
Reviewed-on: #66
2026-03-01 13:55:31 +00:00
88db11292d fix: fallback SSH user per host during bootstrap steps
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 10m6s
2026-03-01 13:34:15 +00:00
8bd064c828 fix: keep micqdf user during kubeadm node rebuilds
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-03-01 13:31:46 +00:00
364d407fb7 Merge pull request 'fix: avoid in-place VM updates on unreliable provider' (#65) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 55m11s
Reviewed-on: #65
2026-03-01 03:58:10 +00:00
c8771b897c Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 15s
2026-03-01 03:57:40 +00:00
68c896d629 fix: avoid in-place VM updates on unreliable provider
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 1m58s
2026-03-01 03:45:28 +00:00
39f1e44f9b Merge pull request 'perf: speed up first bootstrap with fast-mode defaults' (#64) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 5m14s
Reviewed-on: #64
2026-03-01 03:36:21 +00:00
760d0e8b5b perf: speed up first bootstrap with fast-mode defaults
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 1m59s
2026-03-01 03:33:42 +00:00
e48726934f Merge pull request 'feat: convert template-base into k8s-ready VM template' (#63) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #63
2026-03-01 03:03:49 +00:00
92a0908ff5 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
2026-03-01 03:03:24 +00:00
3bdf3f8d84 feat: convert template-base into k8s-ready VM template
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-01 01:24:45 +00:00
42b931668f Merge pull request 'fix: restore use-remote-sudo for nixos-rebuild compatibility' (#62) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #62
2026-03-01 00:22:57 +00:00
dad409a5b7 fix: restore use-remote-sudo for nixos-rebuild compatibility
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 20s
2026-02-28 23:20:12 +00:00
4d6ac7d9dd Merge pull request 'fix: preserve terraform PATH in destroy plan retry' (#61) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m42s
Reviewed-on: #61
2026-02-28 23:05:24 +00:00
0a51dfc0e1 fix: preserve terraform PATH in destroy plan retry
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
2026-02-28 23:04:12 +00:00
92084c3e1a Merge pull request 'fix: enable nix-command for remote gc and use --sudo' (#60) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 2m55s
Reviewed-on: #60
2026-02-28 22:58:28 +00:00
6a77c96ad9 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
2026-02-28 22:57:59 +00:00
45e818b113 fix: enable nix-command for remote gc and use --sudo
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 21s
2026-02-28 22:55:15 +00:00
47ec65a7fd Merge pull request 'stage' (#59) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #59
2026-02-28 22:45:17 +00:00
97795fe376 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 15s
2026-02-28 22:44:17 +00:00
24c3f56399 fix: add timeout and retry for terraform refresh-heavy plans
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 5m22s
2026-02-28 22:23:01 +00:00
f5d9eba9d0 feat: parallelize worker rebuilds with retry and timeout
Some checks failed
Terraform Plan / Terraform Plan (push) Has been cancelled
2026-02-28 22:15:48 +00:00
3e720f1d58 Merge pull request 'stage' (#58) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #58
2026-02-28 21:29:05 +00:00
23a85cc099 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
2026-02-28 21:28:58 +00:00
824e3c09d1 update: increase VM disk sizes for kubeadm nodes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 13s
2026-02-28 21:25:44 +00:00
327c07314c fix: reclaim remote nix store space before rebuild
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 13s
2026-02-28 21:24:26 +00:00
21425c363d Merge pull request 'fix: force bash for remote kubeadm commands' (#57) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 10m49s
Reviewed-on: #57
2026-02-28 21:09:49 +00:00
f6805f8a39 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 13s
2026-02-28 21:07:53 +00:00
3b5d04dda2 fix: force bash for remote kubeadm commands
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 21:06:35 +00:00
f5675d2a84 Merge pull request 'fix: preconfigure remote nix trusted-users before rebuild' (#56) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m40s
Reviewed-on: #56
2026-02-28 20:58:58 +00:00
cf98bdf229 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
2026-02-28 20:58:27 +00:00
ba912810d1 fix: preconfigure remote nix trusted-users before rebuild
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 12s
2026-02-28 20:25:50 +00:00
727c21e43b Merge pull request 'stage' (#55) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 9m38s
Reviewed-on: #55
2026-02-28 20:13:44 +00:00
70ff5ccef9 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
2026-02-28 20:11:37 +00:00
5c037d9a99 fix: prefer root SSH for deploy and trust micqdf in nix
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 20:03:26 +00:00
244887e9c2 fix: auto-detect SSH login user for node operations
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 19:25:48 +00:00
129c639e4d Merge pull request 'fix: ignore recurrent Proxmox cloud-init drift fields' (#54) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 9m39s
Reviewed-on: #54
2026-02-28 19:13:39 +00:00
6105a314b7 Merge remote-tracking branch 'origin/master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 13s
2026-02-28 19:09:38 +00:00
89bc2242cb fix: ignore recurrent Proxmox cloud-init drift fields
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 12s
2026-02-28 19:06:38 +00:00
fce8f9c70c Merge pull request 'fix: allow required VM reboots and serialize apply' (#53) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 2m4s
Reviewed-on: #53
2026-02-28 19:02:04 +00:00
c1c1b3d7f7 Merge remote-tracking branch 'origin/master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
2026-02-28 19:00:36 +00:00
cc40dff49a fix: allow required VM reboots and serialize apply
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 18:55:07 +00:00
812fcb8066 Merge pull request 'fix: ignore cloud-init ssh drift on existing VMs' (#52) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26s
Reviewed-on: #52
2026-02-28 18:51:57 +00:00
d190f64181 fix: ignore cloud-init ssh drift on existing VMs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 18:46:14 +00:00
2126cf5004 Merge pull request 'fix: repair SSH key step quoting in workflows' (#51) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25s
Reviewed-on: #51
2026-02-28 18:38:07 +00:00
2a5ecebd99 fix: repair SSH key step quoting in workflows
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 18:36:58 +00:00
17ac3fad4c Merge pull request 'fix: support base64 SSH private keys in workflows' (#50) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 5m14s
Reviewed-on: #50
2026-02-28 18:25:36 +00:00
3ee5cfa823 fix: support base64 SSH private keys in workflows
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 18:13:56 +00:00
2078afa8a3 Merge pull request 'fix: normalize escaped SSH private key secrets' (#49) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 5m10s
Reviewed-on: #49
2026-02-28 18:06:31 +00:00
2d9d6cdcd5 fix: normalize escaped SSH private key secrets
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 17:57:58 +00:00
8b363497b7 Merge pull request 'fix: prefer SSH_KEY_PRIVATE and validate keypair fingerprint' (#48) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 5m8s
Reviewed-on: #48
2026-02-28 17:50:47 +00:00
03fff813ac fix: prefer SSH_KEY_PRIVATE and validate keypair fingerprint
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 13s
2026-02-28 17:40:25 +00:00
a8195f97dc Merge pull request 'fix: force explicit SSH identity for kubeadm remote operations' (#47) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 4m53s
Reviewed-on: #47
2026-02-28 17:22:56 +00:00
c94c1f61d8 fix: force explicit SSH identity for kubeadm remote operations
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 17:16:31 +00:00
7cdb0bb00b Merge pull request 'fix: preseed known_hosts for kubeadm SSH operations' (#46) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 4m48s
Reviewed-on: #46
2026-02-28 17:09:04 +00:00
046de9b3d4 fix: preseed known_hosts for kubeadm SSH operations
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
2026-02-28 17:07:43 +00:00
b75e6b0124 Merge pull request 'fix: avoid PATH override that hides bash on runners' (#45) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 4m55s
Reviewed-on: #45
2026-02-28 17:01:34 +00:00
b6ce31ad6c fix: avoid PATH override that hides bash on runners
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 17:01:00 +00:00
6f2fa0ef06 Merge pull request 'fix: load nix profile from root path on act runners' (#44) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 1m47s
Reviewed-on: #44
2026-02-28 16:57:42 +00:00
71890c00c0 fix: load nix profile from root path on act runners
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:57:08 +00:00
f8379e6d08 Merge pull request 'fix: add nixbld users as explicit group members' (#43) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 1m3s
Reviewed-on: #43
2026-02-28 16:55:01 +00:00
8d809355eb fix: add nixbld users as explicit group members
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:53:41 +00:00
0f171a668b Merge pull request 'fix: provision nixbld users for root nix install' (#42) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 35s
Reviewed-on: #42
2026-02-28 16:52:35 +00:00
7759c47fea fix: provision nixbld users for root nix install
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 16:49:45 +00:00
8b83bb9d3a Merge pull request 'fix: create /nix when installing nix on root runners' (#41) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 33s
Reviewed-on: #41
2026-02-28 16:48:13 +00:00
9e922dd62c fix: create /nix when installing nix on root runners
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:47:22 +00:00
3539ae9b50 Merge pull request 'stage' (#40) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 32s
Reviewed-on: #40
2026-02-28 16:44:18 +00:00
5669305e59 feat: make kubeadm workflows auto-scale with terraform outputs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 19s
2026-02-28 16:43:22 +00:00
f341816112 feat: run kubeadm reconcile after terraform apply on master
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 16:39:04 +00:00
c04ef106a3 fix: install nix tooling in bootstrap workflow when missing
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:36:42 +00:00
c154ff4d15 Merge pull request 'stage' (#39) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 27s
Reviewed-on: #39
2026-02-28 16:34:24 +00:00
8bcc162956 feat: auto-discover kubeadm node IPs from terraform state
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:31:23 +00:00
b0779c51c0 feat: add gitea workflows for kubeadm bootstrap and reset
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:26:51 +00:00
9fe845b53d feat: add repeatable kubeadm rebuild and reset scripts
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:24:45 +00:00
885a92f494 chore: add lightweight flake checks for kubeadm configs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:19:37 +00:00
91dd20e60e fix: escape shell expansion in kubeadm helper scripts
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:12:25 +00:00
abac6300ca refactor: generate kubeadm host configs from flake
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 16:09:05 +00:00
7206d8cd41 feat: implement kubeadm bootstrap scaffolding for Nix nodes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 16:04:14 +00:00
a42d44bb27 Merge pull request 'stage' (#38) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 27s
Reviewed-on: #38
2026-02-28 15:41:58 +00:00
a99516a2a3 chore: format terraform configuration
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 15:41:14 +00:00
5c69abf9ff fix: disable automatic reboot for proxmox VM updates
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 10s
2026-02-28 15:40:18 +00:00
5fc8bcc406 Merge pull request 'update: set wk-3 worker cores to 4' (#37) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 2m54s
Reviewed-on: #37
2026-02-28 15:36:30 +00:00
16d5a87586 update: set wk-3 worker cores to 4
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
2026-02-28 15:35:52 +00:00
9a02c05983 Merge pull request 'fix: harden destroy workflow and recover state push' (#36) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 5m13s
Reviewed-on: #36
2026-02-28 15:20:29 +00:00
1304afd793 fix: harden destroy workflow and recover state push
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 13s
2026-02-28 15:17:42 +00:00
d1dcbe0feb Merge pull request 'fix: harden apply workflow for gitea runner' (#35) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #35
2026-02-28 15:14:24 +00:00
df4740071a fix: harden apply workflow for gitea runner
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 15:10:33 +00:00
54c0b684c8 Merge pull request 'fix: remove proxmox snippet dependency for cloud-init' (#34) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 5m14s
Reviewed-on: #34
2026-02-28 14:53:00 +00:00
2577669e12 fix: remove proxmox snippet dependency for cloud-init
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 12s
2026-02-28 14:48:14 +00:00
dd3a37dfd1 Merge pull request 'stage' (#33) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 3m19s
Reviewed-on: #33
2026-02-28 14:44:40 +00:00
35f0a0dccb fix: disable terraform wrapper in plan workflow
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-02-28 14:41:47 +00:00
583d5c3591 fix: use gitea checkout action in plan workflow
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 26s
2026-02-28 14:39:45 +00:00
77626ed93c fix: restore checkout in plan workflow
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 27s
2026-02-28 14:38:21 +00:00
a5d5ddb618 fix: remove checkout action from plan workflow
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 2s
2026-02-28 14:35:48 +00:00
a5f8d72bff fix: disable artifact upload in plan workflow
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 16s
2026-02-28 14:28:33 +00:00
335254b7b2 fix: remove cross-variable validation from worker lists
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 17s
Terraform variable validation blocks can only reference the variable itself, so list length checks against worker_count were removed to restore init/plan.
2026-02-28 14:19:00 +00:00
21be01346b feat: refactor infra to cp/wk kubeadm topology
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 9s
Provision 3 thin control planes and 3 workers with role-specific sizing and VMID ranges (701/711), generate per-node cloud-init snippets with SSH key injection, and add NixOS kubeadm host/module scaffolding for cp-1..3 and wk-1..3.
2026-02-28 14:16:55 +00:00
31 changed files with 2523 additions and 140 deletions

View File

@@ -0,0 +1,181 @@
name: Kubeadm Bootstrap
run-name: ${{ gitea.actor }} requested kubeadm bootstrap
on:
workflow_dispatch:
inputs:
confirm:
description: "Type BOOTSTRAP to run rebuild + kubeadm bootstrap"
required: true
type: string
concurrency:
group: kubeadm-bootstrap
cancel-in-progress: false
jobs:
bootstrap:
name: "Rebuild and Bootstrap Cluster"
runs-on: ubuntu-latest
steps:
- name: Validate confirmation phrase
run: |
if [ "${{ inputs.confirm }}" != "BOOTSTRAP" ]; then
echo "Confirmation failed. You must type BOOTSTRAP."
exit 1
fi
- name: Checkout repository
uses: actions/checkout@v4
- name: Create SSH key
run: |
install -m 0700 -d ~/.ssh
KEY_SOURCE=""
KEY_CONTENT=""
KEY_B64="$(printf '%s' "${{ secrets.SSH_KEY_PRIVATE_BASE64 }}")"
if [ -n "$KEY_B64" ]; then
KEY_SOURCE="SSH_KEY_PRIVATE_BASE64"
KEY_CONTENT="$(printf '%s' "$KEY_B64" | base64 -d)"
else
KEY_CONTENT="$(printf '%s' "${{ secrets.SSH_KEY_PRIVATE }}")"
if [ -n "$KEY_CONTENT" ]; then
KEY_SOURCE="SSH_KEY_PRIVATE"
else
KEY_CONTENT="$(printf '%s' "${{ secrets.KUBEADM_SSH_PRIVATE_KEY }}")"
KEY_SOURCE="KUBEADM_SSH_PRIVATE_KEY"
fi
fi
if [ -z "$KEY_CONTENT" ]; then
echo "Missing SSH private key secret. Set SSH_KEY_PRIVATE_BASE64, SSH_KEY_PRIVATE, or KUBEADM_SSH_PRIVATE_KEY."
exit 1
fi
KEY_CONTENT="$(printf '%s' "$KEY_CONTENT" | tr -d '\r')"
if printf '%s' "$KEY_CONTENT" | grep -q '\\n'; then
printf '%b' "$KEY_CONTENT" > ~/.ssh/id_ed25519
else
printf '%s\n' "$KEY_CONTENT" > ~/.ssh/id_ed25519
fi
chmod 0600 ~/.ssh/id_ed25519
if ! ssh-keygen -y -f ~/.ssh/id_ed25519 >/dev/null 2>&1; then
echo "Invalid private key content from $KEY_SOURCE"
exit 1
fi
- name: Set up Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.6
terraform_wrapper: false
- name: Build Terraform backend files
working-directory: terraform
run: |
cat > secrets.auto.tfvars << EOF
pm_api_token_secret = "${{ secrets.PM_API_TOKEN_SECRET }}"
SSH_KEY_PUBLIC = "$(printf '%s' "${{ secrets.SSH_KEY_PUBLIC }}" | tr -d '\r\n')"
EOF
cat > backend.hcl << EOF
bucket = "${{ secrets.B2_TF_BUCKET }}"
key = "terraform.tfstate"
region = "us-east-005"
endpoints = {
s3 = "${{ secrets.B2_TF_ENDPOINT }}"
}
access_key = "$(printf '%s' "${{ secrets.B2_KEY_ID }}" | tr -d '\r\n')"
secret_key = "$(printf '%s' "${{ secrets.B2_APPLICATION_KEY }}" | tr -d '\r\n')"
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = true
EOF
- name: Terraform init for state read
working-directory: terraform
run: terraform init -reconfigure -backend-config=backend.hcl
- name: Create kubeadm inventory
env:
KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
run: |
set -euo pipefail
terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
- name: Validate nix installation
run: |
if [ -x /nix/var/nix/profiles/default/bin/nix ]; then
/nix/var/nix/profiles/default/bin/nix --version
exit 0
fi
if command -v nix >/dev/null 2>&1; then
nix --version
exit 0
fi
echo "Nix missing; installing no-daemon Nix for this runner job"
if [ "$(id -u)" -eq 0 ]; then
mkdir -p /nix
chown root:root /nix
chmod 0755 /nix
if ! getent group nixbld >/dev/null 2>&1; then
groupadd --system nixbld
fi
for i in $(seq 1 10); do
if ! id "nixbld$i" >/dev/null 2>&1; then
useradd --system --create-home --home-dir /var/empty --shell /usr/sbin/nologin "nixbld$i"
fi
usermod -a -G nixbld "nixbld$i"
done
fi
sh <(curl -L https://nixos.org/nix/install) --no-daemon
if [ -f "$HOME/.nix-profile/etc/profile.d/nix.sh" ]; then
. "$HOME/.nix-profile/etc/profile.d/nix.sh"
elif [ -f "/root/.nix-profile/etc/profile.d/nix.sh" ]; then
. /root/.nix-profile/etc/profile.d/nix.sh
fi
export PATH="$HOME/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:$PATH"
nix --version
- name: Install nixos-rebuild tool
env:
NIX_CONFIG: experimental-features = nix-command flakes
run: |
if [ -f "$HOME/.nix-profile/etc/profile.d/nix.sh" ]; then
. "$HOME/.nix-profile/etc/profile.d/nix.sh"
elif [ -f "/root/.nix-profile/etc/profile.d/nix.sh" ]; then
. /root/.nix-profile/etc/profile.d/nix.sh
fi
export PATH="$HOME/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:$PATH"
nix profile install nixpkgs#nixos-rebuild
- name: Run cluster rebuild and bootstrap
env:
NIX_CONFIG: experimental-features = nix-command flakes
FAST_MODE: "1"
WORKER_PARALLELISM: "3"
REBUILD_TIMEOUT: "45m"
REBUILD_RETRIES: "2"
run: |
if [ -f "$HOME/.nix-profile/etc/profile.d/nix.sh" ]; then
. "$HOME/.nix-profile/etc/profile.d/nix.sh"
elif [ -f "/root/.nix-profile/etc/profile.d/nix.sh" ]; then
. /root/.nix-profile/etc/profile.d/nix.sh
fi
export PATH="$HOME/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH"
./nixos/kubeadm/scripts/rebuild-and-bootstrap.sh

View File

@@ -0,0 +1,112 @@
name: Kubeadm Reset
run-name: ${{ gitea.actor }} requested kubeadm reset
on:
workflow_dispatch:
inputs:
confirm:
description: "Type RESET to run kubeadm reset on all nodes"
required: true
type: string
concurrency:
group: kubeadm-bootstrap
cancel-in-progress: false
jobs:
reset:
name: "Reset Cluster Nodes"
runs-on: ubuntu-latest
steps:
- name: Validate confirmation phrase
run: |
if [ "${{ inputs.confirm }}" != "RESET" ]; then
echo "Confirmation failed. You must type RESET."
exit 1
fi
- name: Checkout repository
uses: actions/checkout@v4
- name: Create SSH key
run: |
install -m 0700 -d ~/.ssh
KEY_SOURCE=""
KEY_CONTENT=""
KEY_B64="$(printf '%s' "${{ secrets.SSH_KEY_PRIVATE_BASE64 }}")"
if [ -n "$KEY_B64" ]; then
KEY_SOURCE="SSH_KEY_PRIVATE_BASE64"
KEY_CONTENT="$(printf '%s' "$KEY_B64" | base64 -d)"
else
KEY_CONTENT="$(printf '%s' "${{ secrets.SSH_KEY_PRIVATE }}")"
if [ -n "$KEY_CONTENT" ]; then
KEY_SOURCE="SSH_KEY_PRIVATE"
else
KEY_CONTENT="$(printf '%s' "${{ secrets.KUBEADM_SSH_PRIVATE_KEY }}")"
KEY_SOURCE="KUBEADM_SSH_PRIVATE_KEY"
fi
fi
if [ -z "$KEY_CONTENT" ]; then
echo "Missing SSH private key secret. Set SSH_KEY_PRIVATE_BASE64, SSH_KEY_PRIVATE, or KUBEADM_SSH_PRIVATE_KEY."
exit 1
fi
KEY_CONTENT="$(printf '%s' "$KEY_CONTENT" | tr -d '\r')"
if printf '%s' "$KEY_CONTENT" | grep -q '\\n'; then
printf '%b' "$KEY_CONTENT" > ~/.ssh/id_ed25519
else
printf '%s\n' "$KEY_CONTENT" > ~/.ssh/id_ed25519
fi
chmod 0600 ~/.ssh/id_ed25519
if ! ssh-keygen -y -f ~/.ssh/id_ed25519 >/dev/null 2>&1; then
echo "Invalid private key content from $KEY_SOURCE"
exit 1
fi
- name: Set up Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.6
terraform_wrapper: false
- name: Build Terraform backend files
working-directory: terraform
run: |
cat > secrets.auto.tfvars << EOF
pm_api_token_secret = "${{ secrets.PM_API_TOKEN_SECRET }}"
SSH_KEY_PUBLIC = "$(printf '%s' "${{ secrets.SSH_KEY_PUBLIC }}" | tr -d '\r\n')"
EOF
cat > backend.hcl << EOF
bucket = "${{ secrets.B2_TF_BUCKET }}"
key = "terraform.tfstate"
region = "us-east-005"
endpoints = {
s3 = "${{ secrets.B2_TF_ENDPOINT }}"
}
access_key = "$(printf '%s' "${{ secrets.B2_KEY_ID }}" | tr -d '\r\n')"
secret_key = "$(printf '%s' "${{ secrets.B2_APPLICATION_KEY }}" | tr -d '\r\n')"
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = true
EOF
- name: Terraform init for state read
working-directory: terraform
run: terraform init -reconfigure -backend-config=backend.hcl
- name: Create kubeadm inventory
env:
KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
run: |
set -euo pipefail
terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
- name: Run cluster reset
run: |
./nixos/kubeadm/scripts/reset-cluster-nodes.sh

View File

@@ -45,6 +45,7 @@ jobs:
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.6
terraform_wrapper: false
- name: Terraform Init
working-directory: terraform
@@ -52,7 +53,20 @@ jobs:
- name: Terraform Plan
working-directory: terraform
run: terraform plan -out=tfplan
run: |
set -euo pipefail
for attempt in 1 2; do
echo "Terraform plan attempt $attempt/2"
if timeout 20m terraform plan -refresh=false -parallelism=1 -out=tfplan; then
exit 0
fi
if [ "$attempt" -eq 1 ]; then
echo "Plan attempt failed or timed out; retrying in 20s"
sleep 20
fi
done
echo "Terraform plan failed after retries"
exit 1
- name: Block accidental destroy
env:
@@ -69,4 +83,127 @@ jobs:
- name: Terraform Apply
working-directory: terraform
run: terraform apply -auto-approve tfplan
run: terraform apply -parallelism=1 -auto-approve tfplan
- name: Create SSH key
run: |
install -m 0700 -d ~/.ssh
KEY_SOURCE=""
KEY_CONTENT=""
KEY_B64="$(printf '%s' "${{ secrets.SSH_KEY_PRIVATE_BASE64 }}")"
if [ -n "$KEY_B64" ]; then
KEY_SOURCE="SSH_KEY_PRIVATE_BASE64"
KEY_CONTENT="$(printf '%s' "$KEY_B64" | base64 -d)"
else
KEY_CONTENT="$(printf '%s' "${{ secrets.SSH_KEY_PRIVATE }}")"
if [ -n "$KEY_CONTENT" ]; then
KEY_SOURCE="SSH_KEY_PRIVATE"
else
KEY_CONTENT="$(printf '%s' "${{ secrets.KUBEADM_SSH_PRIVATE_KEY }}")"
KEY_SOURCE="KUBEADM_SSH_PRIVATE_KEY"
fi
fi
if [ -z "$KEY_CONTENT" ]; then
echo "Missing SSH private key secret. Set SSH_KEY_PRIVATE_BASE64, SSH_KEY_PRIVATE, or KUBEADM_SSH_PRIVATE_KEY."
exit 1
fi
KEY_CONTENT="$(printf '%s' "$KEY_CONTENT" | tr -d '\r')"
if printf '%s' "$KEY_CONTENT" | grep -q '\\n'; then
printf '%b' "$KEY_CONTENT" > ~/.ssh/id_ed25519
else
printf '%s\n' "$KEY_CONTENT" > ~/.ssh/id_ed25519
fi
chmod 0600 ~/.ssh/id_ed25519
if ! ssh-keygen -y -f ~/.ssh/id_ed25519 >/dev/null 2>&1; then
echo "Invalid private key content from $KEY_SOURCE"
exit 1
fi
- name: Verify SSH keypair match
run: |
if ! ssh-keygen -y -f ~/.ssh/id_ed25519 >/tmp/key.pub 2>/tmp/key.err; then
echo "Invalid private key content in SSH_KEY_PRIVATE/KUBEADM_SSH_PRIVATE_KEY"
cat /tmp/key.err
exit 1
fi
printf '%s\n' "${{ secrets.SSH_KEY_PUBLIC }}" | tr -d '\r' > /tmp/secret.pub
if ! ssh-keygen -lf /tmp/secret.pub >/tmp/secret.fp 2>/tmp/secret.err; then
echo "Invalid SSH_KEY_PUBLIC format"
cat /tmp/secret.err
exit 1
fi
PRIV_FP="$(ssh-keygen -lf /tmp/key.pub | awk '{print $2}')"
PUB_FP="$(awk '{print $2}' /tmp/secret.fp)"
echo "private fingerprint: $PRIV_FP"
echo "public fingerprint: $PUB_FP"
if [ "$PRIV_FP" != "$PUB_FP" ]; then
echo "SSH_KEY_PRIVATE does not match SSH_KEY_PUBLIC. Update secrets with the same keypair."
exit 1
fi
- name: Create kubeadm inventory from Terraform outputs
env:
KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
run: |
set -euo pipefail
terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
- name: Ensure nix and nixos-rebuild
env:
NIX_CONFIG: experimental-features = nix-command flakes
run: |
if [ ! -x /nix/var/nix/profiles/default/bin/nix ] && ! command -v nix >/dev/null 2>&1; then
if [ "$(id -u)" -eq 0 ]; then
mkdir -p /nix
chown root:root /nix
chmod 0755 /nix
if ! getent group nixbld >/dev/null 2>&1; then
groupadd --system nixbld
fi
for i in $(seq 1 10); do
if ! id "nixbld$i" >/dev/null 2>&1; then
useradd --system --create-home --home-dir /var/empty --shell /usr/sbin/nologin "nixbld$i"
fi
usermod -a -G nixbld "nixbld$i"
done
fi
sh <(curl -L https://nixos.org/nix/install) --no-daemon
fi
if [ -f "$HOME/.nix-profile/etc/profile.d/nix.sh" ]; then
. "$HOME/.nix-profile/etc/profile.d/nix.sh"
elif [ -f "/root/.nix-profile/etc/profile.d/nix.sh" ]; then
. /root/.nix-profile/etc/profile.d/nix.sh
fi
export PATH="$HOME/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:$PATH"
nix --version
nix profile install nixpkgs#nixos-rebuild
- name: Rebuild and bootstrap/reconcile kubeadm cluster
env:
NIX_CONFIG: experimental-features = nix-command flakes
FAST_MODE: "1"
WORKER_PARALLELISM: "3"
REBUILD_TIMEOUT: "45m"
REBUILD_RETRIES: "2"
run: |
if [ -f "$HOME/.nix-profile/etc/profile.d/nix.sh" ]; then
. "$HOME/.nix-profile/etc/profile.d/nix.sh"
elif [ -f "/root/.nix-profile/etc/profile.d/nix.sh" ]; then
. /root/.nix-profile/etc/profile.d/nix.sh
fi
export PATH="$HOME/.nix-profile/bin:/root/.nix-profile/bin:/nix/var/nix/profiles/default/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH"
./nixos/kubeadm/scripts/rebuild-and-bootstrap.sh

View File

@@ -15,8 +15,8 @@ on:
type: choice
options:
- all
- alpacas
- llamas
- control-planes
- workers
concurrency:
group: terraform-global
@@ -65,6 +65,7 @@ jobs:
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.6
terraform_wrapper: false
- name: Terraform Init
working-directory: terraform
@@ -73,15 +74,16 @@ jobs:
- name: Terraform Destroy Plan
working-directory: terraform
run: |
set -euo pipefail
case "${{ inputs.target }}" in
all)
terraform plan -destroy -out=tfdestroy
TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -out=tfdestroy"
;;
alpacas)
terraform plan -destroy -target=proxmox_vm_qemu.alpacas -out=tfdestroy
control-planes)
TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy"
;;
llamas)
terraform plan -destroy -target=proxmox_vm_qemu.llamas -out=tfdestroy
workers)
TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy"
;;
*)
echo "Invalid destroy target: ${{ inputs.target }}"
@@ -89,6 +91,36 @@ jobs:
;;
esac
for attempt in 1 2; do
echo "Terraform destroy plan attempt $attempt/2"
if timeout 20m sh -c "$TF_PLAN_CMD"; then
exit 0
fi
if [ "$attempt" -eq 1 ]; then
echo "Destroy plan attempt failed or timed out; retrying in 20s"
sleep 20
fi
done
echo "Terraform destroy plan failed after retries"
exit 1
- name: Terraform Destroy Apply
working-directory: terraform
run: terraform apply -auto-approve tfdestroy
run: |
set +e
terraform apply -auto-approve tfdestroy 2>&1 | tee destroy-apply.log
APPLY_EXIT=${PIPESTATUS[0]}
if [ "$APPLY_EXIT" -ne 0 ] && [ -f errored.tfstate ] && grep -q "Failed to persist state to backend" destroy-apply.log; then
echo "Detected backend state write failure after destroy; attempting recovery push..."
terraform state push errored.tfstate
PUSH_EXIT=$?
if [ "$PUSH_EXIT" -eq 0 ]; then
echo "Recovered by pushing errored.tfstate to backend."
exit 0
fi
fi
exit "$APPLY_EXIT"

View File

@@ -51,6 +51,7 @@ jobs:
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.6
terraform_wrapper: false
- name: Terraform Init
working-directory: terraform
@@ -66,7 +67,20 @@ jobs:
- name: Terraform Plan
working-directory: terraform
run: terraform plan -out=tfplan
run: |
set -euo pipefail
for attempt in 1 2; do
echo "Terraform plan attempt $attempt/2"
if timeout 20m terraform plan -refresh=false -parallelism=1 -out=tfplan; then
exit 0
fi
if [ "$attempt" -eq 1 ]; then
echo "Plan attempt failed or timed out; retrying in 20s"
sleep 20
fi
done
echo "Terraform plan failed after retries"
exit 1
- name: Block accidental destroy
env:
@@ -81,8 +95,7 @@ jobs:
exit 1
fi
- name: Upload Terraform Plan
uses: actions/upload-artifact@v3
with:
name: terraform-plan
path: terraform/tfplan
# NOTE: Disabled artifact upload for now.
# On this Gitea/act runner, post-job hooks from artifact actions can
# fail during "Complete job" even when all Terraform steps succeeded.
# Re-enable once runner/action compatibility is confirmed.

169
nixos/kubeadm/README.md Normal file
View File

@@ -0,0 +1,169 @@
# Kubeadm Cluster Layout (NixOS)
This folder defines role-based NixOS configs for a kubeadm cluster.
## Topology
- Control planes: `cp-1`, `cp-2`, `cp-3`
- Workers: `wk-1`, `wk-2`, `wk-3`
## What this provides
- Shared Kubernetes/node prerequisites in `modules/k8s-common.nix`
- Shared cluster defaults in `modules/k8s-cluster-settings.nix`
- Role-specific settings for control planes and workers
- Generated per-node host configs from `flake.nix` (no duplicated host files)
- Bootstrap helper commands on each node:
- `th-kubeadm-init`
- `th-kubeadm-join-control-plane`
- `th-kubeadm-join-worker`
- `th-kubeadm-status`
- A Python bootstrap controller for orchestration:
- `bootstrap/controller.py`
## Layered architecture
- `terraform/`: VM lifecycle only
- `nixos/kubeadm/modules/`: declarative node OS config only
- `nixos/kubeadm/bootstrap/controller.py`: imperative cluster reconciliation state machine
## Hardware config files
The flake automatically imports `hosts/hardware/<host>.nix` if present.
Copy each node's generated hardware config into this folder:
```bash
sudo nixos-generate-config
sudo cp /etc/nixos/hardware-configuration.nix ./hosts/hardware/cp-1.nix
```
Repeat for each node (`cp-2`, `cp-3`, `wk-1`, `wk-2`, `wk-3`).
## Deploy approach
Start from one node at a time while experimenting:
```bash
sudo nixos-rebuild switch --flake .#cp-1
```
For remote target-host workflows, use your preferred deploy wrapper later
(`nixos-rebuild --target-host ...` or deploy-rs/colmena).
## Bootstrap runbook (kubeadm + kube-vip + Flannel)
1. Apply Nix config on all nodes (`cp-*`, then `wk-*`).
2. On `cp-1`, run:
```bash
sudo th-kubeadm-init
```
This infers the control-plane VIP as `<node-subnet>.250` on `eth0`, creates the
kube-vip static pod manifest, and runs `kubeadm init`.
3. Install Flannel from `cp-1`:
```bash
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/v0.25.5/Documentation/kube-flannel.yml
```
4. Generate join commands on `cp-1`:
```bash
sudo kubeadm token create --print-join-command
sudo kubeadm init phase upload-certs --upload-certs
```
5. Join `cp-2` and `cp-3`:
```bash
sudo th-kubeadm-join-control-plane '<kubeadm join ... --control-plane --certificate-key ...>'
```
6. Join workers:
```bash
sudo th-kubeadm-join-worker '<kubeadm join ...>'
```
7. Validate from a control plane:
```bash
kubectl get nodes -o wide
kubectl -n kube-system get pods -o wide
```
## Fresh bootstrap flow (recommended)
1. Copy and edit inventory:
```bash
cp ./scripts/inventory.example.env ./scripts/inventory.env
$EDITOR ./scripts/inventory.env
```
2. Rebuild all nodes and bootstrap a fresh cluster:
```bash
./scripts/rebuild-and-bootstrap.sh
```
Optional tuning env vars:
```bash
FAST_MODE=1 WORKER_PARALLELISM=3 REBUILD_TIMEOUT=45m REBUILD_RETRIES=2 ./scripts/rebuild-and-bootstrap.sh
```
- `FAST_MODE=1` skips pre-rebuild remote GC cleanup to reduce wall-clock time.
- Set `FAST_MODE=0` for a slower but more aggressive space cleanup pass.
### Bootstrap controller state
The controller stores checkpoints in both places:
- Remote (source of truth): `/var/lib/terrahome/bootstrap-state.json` on `cp-1`
- Local copy (workflow/debug artifact): `nixos/kubeadm/bootstrap/bootstrap-state-last.json`
This makes retries resumable and keeps failure context visible from CI.
3. If you only want to reset Kubernetes state on existing VMs:
```bash
./scripts/reset-cluster-nodes.sh
```
For a full nuke/recreate lifecycle:
- run Terraform destroy/apply for VMs first,
- then run `./scripts/rebuild-and-bootstrap.sh` again.
Node lists now come directly from static Terraform outputs, so bootstrap no longer
depends on Proxmox guest-agent IP discovery or SSH subnet scanning.
## Optional Gitea workflow automation
Primary flow:
- Push to `master` triggers `.gitea/workflows/terraform-apply.yml`
- That workflow now does Terraform apply and then runs a fresh kubeadm bootstrap automatically
Manual dispatch workflows are available:
- `.gitea/workflows/kubeadm-bootstrap.yml`
- `.gitea/workflows/kubeadm-reset.yml`
Required repository secrets:
- Existing Terraform/backend secrets used by current workflows (`B2_*`, `PM_API_TOKEN_SECRET`, `SSH_KEY_PUBLIC`)
- SSH private key: prefer `KUBEADM_SSH_PRIVATE_KEY`, fallback to existing `SSH_KEY_PRIVATE`
Optional secrets:
- `KUBEADM_SSH_USER` (defaults to `micqdf`)
Node IPs are rendered directly from static Terraform outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets or SSH discovery fallbacks.
## Notes
- Scripts are intentionally manual-triggered (predictable for homelab bring-up).
- If `.250` on the node subnet is already in use, change `controlPlaneVipSuffix`
in `modules/k8s-cluster-settings.nix` before bootstrap.

View File

@@ -0,0 +1,447 @@
#!/usr/bin/env python3
import argparse
import base64
import json
import os
import shlex
import subprocess
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
def run_local(cmd, check=True, capture=False):
if isinstance(cmd, str):
shell = True
else:
shell = False
return subprocess.run(
cmd,
shell=shell,
check=check,
text=True,
capture_output=capture,
)
def load_inventory(inventory_file):
inventory_file = Path(inventory_file).resolve()
if not inventory_file.exists():
raise RuntimeError(f"Missing inventory file: {inventory_file}")
cmd = (
"set -a; "
f"source {shlex.quote(str(inventory_file))}; "
"python3 - <<'PY'\n"
"import json, os\n"
"print(json.dumps(dict(os.environ)))\n"
"PY"
)
proc = run_local(["bash", "-lc", cmd], capture=True)
env = json.loads(proc.stdout)
node_ips = {}
cp_names = []
wk_names = []
control_planes = env.get("CONTROL_PLANES", "").strip()
workers = env.get("WORKERS", "").strip()
if control_planes:
for pair in control_planes.split():
name, ip = pair.split("=", 1)
node_ips[name] = ip
cp_names.append(name)
else:
for key in sorted(k for k in env if k.startswith("CP_") and k[3:].isdigit()):
idx = key.split("_", 1)[1]
name = f"cp-{idx}"
node_ips[name] = env[key]
cp_names.append(name)
if workers:
for pair in workers.split():
name, ip = pair.split("=", 1)
node_ips[name] = ip
wk_names.append(name)
else:
for key in sorted(k for k in env if k.startswith("WK_") and k[3:].isdigit()):
idx = key.split("_", 1)[1]
name = f"wk-{idx}"
node_ips[name] = env[key]
wk_names.append(name)
if not cp_names or not wk_names:
raise RuntimeError("Inventory must include control planes and workers")
primary_cp = env.get("PRIMARY_CONTROL_PLANE", "cp-1")
if primary_cp not in node_ips:
primary_cp = cp_names[0]
return {
"env": env,
"node_ips": node_ips,
"cp_names": cp_names,
"wk_names": wk_names,
"primary_cp": primary_cp,
"inventory_file": str(inventory_file),
}
class Controller:
def __init__(self, cfg):
self.env = cfg["env"]
self.node_ips = cfg["node_ips"]
self.cp_names = cfg["cp_names"]
self.wk_names = cfg["wk_names"]
self.primary_cp = cfg["primary_cp"]
self.primary_ip = self.node_ips[self.primary_cp]
self.script_dir = Path(__file__).resolve().parent
self.flake_dir = Path(self.env.get("FLAKE_DIR") or (self.script_dir.parent)).resolve()
self.ssh_user = self.env.get("SSH_USER", "micqdf")
self.ssh_candidates = self.env.get("SSH_USER_CANDIDATES", f"root {self.ssh_user}").split()
self.active_ssh_user = self.ssh_user
self.ssh_key = self.env.get("SSH_KEY_PATH", str(Path.home() / ".ssh" / "id_ed25519"))
self.ssh_opts = [
"-o",
"BatchMode=yes",
"-o",
"IdentitiesOnly=yes",
"-o",
"StrictHostKeyChecking=no",
"-o",
"UserKnownHostsFile=/dev/null",
"-i",
self.ssh_key,
]
self.rebuild_timeout = self.env.get("REBUILD_TIMEOUT", "45m")
self.rebuild_retries = int(self.env.get("REBUILD_RETRIES", "2"))
self.worker_parallelism = int(self.env.get("WORKER_PARALLELISM", "3"))
self.fast_mode = self.env.get("FAST_MODE", "1")
self.skip_rebuild = self.env.get("SKIP_REBUILD", "0") == "1"
self.force_reinit = True
self.ssh_ready_retries = int(self.env.get("SSH_READY_RETRIES", "20"))
self.ssh_ready_delay = int(self.env.get("SSH_READY_DELAY_SEC", "15"))
def log(self, msg):
print(f"==> {msg}")
def _ssh(self, user, ip, cmd, check=True):
full = ["ssh", *self.ssh_opts, f"{user}@{ip}", f"bash -lc {shlex.quote(cmd)}"]
return run_local(full, check=check, capture=True)
def detect_user(self, ip):
for attempt in range(1, self.ssh_ready_retries + 1):
for user in self.ssh_candidates:
proc = self._ssh(user, ip, "true", check=False)
if proc.returncode == 0:
self.active_ssh_user = user
self.log(f"Using SSH user '{user}' for {ip}")
return
if attempt < self.ssh_ready_retries:
self.log(
f"SSH not ready on {ip} yet; retrying in {self.ssh_ready_delay}s "
f"({attempt}/{self.ssh_ready_retries})"
)
time.sleep(self.ssh_ready_delay)
raise RuntimeError(
"Unable to authenticate to "
f"{ip} with users: {', '.join(self.ssh_candidates)}. "
"If this is a freshly cloned VM, the Proxmox source template likely does not yet include the "
"current cloud-init-capable NixOS template configuration from nixos/template-base. "
"Terraform can only clone what exists in Proxmox; it cannot retrofit cloud-init support into an old template."
)
def remote(self, ip, cmd, check=True):
ordered = [self.active_ssh_user] + [u for u in self.ssh_candidates if u != self.active_ssh_user]
last = None
for user in ordered:
proc = self._ssh(user, ip, cmd, check=False)
if proc.returncode == 0:
self.active_ssh_user = user
return proc
if proc.returncode != 255:
last = proc
break
last = proc
if check:
stdout = (last.stdout or "").strip()
stderr = (last.stderr or "").strip()
raise RuntimeError(f"Remote command failed on {ip}: {cmd}\n{stdout}\n{stderr}")
return last
def prepare_known_hosts(self):
pass
def prepare_remote_nix(self, ip):
self.remote(ip, "sudo mkdir -p /etc/nix")
self.remote(ip, "if [ -f /etc/nix/nix.conf ]; then sudo sed -i '/^trusted-users[[:space:]]*=/d' /etc/nix/nix.conf; fi")
self.remote(ip, "echo 'trusted-users = root micqdf' | sudo tee -a /etc/nix/nix.conf >/dev/null")
self.remote(ip, "sudo systemctl restart nix-daemon 2>/dev/null || true")
def prepare_remote_kubelet(self, ip):
self.remote(ip, "sudo systemctl stop kubelet >/dev/null 2>&1 || true")
self.remote(ip, "sudo systemctl disable kubelet >/dev/null 2>&1 || true")
self.remote(ip, "sudo systemctl mask kubelet >/dev/null 2>&1 || true")
self.remote(ip, "sudo systemctl reset-failed kubelet >/dev/null 2>&1 || true")
self.remote(ip, "sudo rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env || true")
def prepare_remote_space(self, ip):
self.remote(ip, "sudo nix-collect-garbage -d || true")
self.remote(ip, "sudo nix --extra-experimental-features nix-command store gc || true")
self.remote(ip, "sudo rm -rf /tmp/nix* /tmp/nixos-rebuild* || true")
def rebuild_node_once(self, name, ip):
self.detect_user(ip)
cmd = [
"timeout",
self.rebuild_timeout,
"nixos-rebuild",
"switch",
"--flake",
f"{self.flake_dir}#{name}",
"--target-host",
f"{self.active_ssh_user}@{ip}",
"--use-remote-sudo",
]
env = os.environ.copy()
env["NIX_SSHOPTS"] = " ".join(self.ssh_opts)
proc = subprocess.run(cmd, text=True, env=env)
return proc.returncode == 0
def rebuild_with_retry(self, name, ip):
max_attempts = self.rebuild_retries + 1
for attempt in range(1, max_attempts + 1):
self.log(f"Rebuild attempt {attempt}/{max_attempts} for {name}")
if self.rebuild_node_once(name, ip):
return
if attempt < max_attempts:
self.log(f"Rebuild failed for {name}, retrying in 20s")
time.sleep(20)
raise RuntimeError(f"Rebuild failed permanently for {name}")
def stage_preflight(self):
self.prepare_known_hosts()
self.detect_user(self.primary_ip)
def stage_rebuild(self):
if self.skip_rebuild:
self.log("Node rebuild already complete")
return
self.detect_user(self.primary_ip)
for name in self.cp_names:
ip = self.node_ips[name]
self.log(f"Preparing and rebuilding {name} ({ip})")
self.prepare_remote_nix(ip)
self.prepare_remote_kubelet(ip)
if self.fast_mode != "1":
self.prepare_remote_space(ip)
self.rebuild_with_retry(name, ip)
for name in self.wk_names:
ip = self.node_ips[name]
self.log(f"Preparing {name} ({ip})")
self.prepare_remote_nix(ip)
self.prepare_remote_kubelet(ip)
if self.fast_mode != "1":
self.prepare_remote_space(ip)
failures = []
with ThreadPoolExecutor(max_workers=self.worker_parallelism) as pool:
futures = {pool.submit(self.rebuild_with_retry, name, self.node_ips[name]): name for name in self.wk_names}
for fut in as_completed(futures):
name = futures[fut]
try:
fut.result()
except Exception as exc:
failures.append((name, str(exc)))
if failures:
raise RuntimeError(f"Worker rebuild failures: {failures}")
def has_admin_conf(self):
return self.remote(self.primary_ip, "sudo test -f /etc/kubernetes/admin.conf", check=False).returncode == 0
def cluster_ready(self):
cmd = "sudo test -f /etc/kubernetes/admin.conf && sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get --raw=/readyz >/dev/null 2>&1"
return self.remote(self.primary_ip, cmd, check=False).returncode == 0
def stage_init_primary(self):
self.log(f"Initializing primary control plane on {self.primary_cp}")
self.remote(self.primary_ip, "sudo th-kubeadm-init")
def stage_install_cni(self):
self.log("Installing Flannel")
manifest_path = self.script_dir.parent / "manifests" / "kube-flannel.yml"
manifest_b64 = base64.b64encode(manifest_path.read_bytes()).decode()
self.remote(
self.primary_ip,
(
"sudo mkdir -p /var/lib/terrahome && "
f"echo {shlex.quote(manifest_b64)} | base64 -d | sudo tee /var/lib/terrahome/kube-flannel.yml >/dev/null"
),
)
self.log("Waiting for API readiness before applying Flannel")
ready = False
for _ in range(30):
if self.cluster_ready():
ready = True
break
time.sleep(10)
if not ready:
raise RuntimeError("API server did not become ready before Flannel install")
last_error = None
for attempt in range(1, 6):
proc = self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf apply -f /var/lib/terrahome/kube-flannel.yml",
check=False,
)
if proc.returncode == 0:
return
last_error = (proc.stdout or "") + ("\n" if proc.stdout and proc.stderr else "") + (proc.stderr or "")
self.log(f"Flannel apply attempt {attempt}/5 failed; retrying in 15s")
time.sleep(15)
raise RuntimeError(f"Flannel apply failed after retries\n{last_error or ''}")
def cluster_has_node(self, name):
cmd = f"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get node {shlex.quote(name)} >/dev/null 2>&1"
return self.remote(self.primary_ip, cmd, check=False).returncode == 0
def build_join_cmds(self):
join_cmd = self.remote(
self.primary_ip,
"sudo KUBECONFIG=/etc/kubernetes/admin.conf kubeadm token create --print-join-command",
).stdout.strip()
cert_key = self.remote(
self.primary_ip,
"sudo KUBECONFIG=/etc/kubernetes/admin.conf kubeadm init phase upload-certs --upload-certs | tail -n 1",
).stdout.strip()
cp_join = f"{join_cmd} --control-plane --certificate-key {cert_key}"
return join_cmd, cp_join
def stage_join_control_planes(self):
_, cp_join = self.build_join_cmds()
for node in self.cp_names:
if node == self.primary_cp:
continue
if self.cluster_has_node(node):
self.log(f"{node} already joined")
continue
self.log(f"Joining control plane {node}")
ip = self.node_ips[node]
node_join = f"{cp_join} --node-name {node} --ignore-preflight-errors=NumCPU,HTTPProxyCIDR"
self.remote(ip, f"sudo th-kubeadm-join-control-plane {shlex.quote(node_join)}")
def stage_join_workers(self):
join_cmd, _ = self.build_join_cmds()
for node in self.wk_names:
if self.cluster_has_node(node):
self.log(f"{node} already joined")
continue
self.log(f"Joining worker {node}")
ip = self.node_ips[node]
node_join = f"{join_cmd} --node-name {node} --ignore-preflight-errors=HTTPProxyCIDR"
self.remote(ip, f"sudo th-kubeadm-join-worker {shlex.quote(node_join)}")
def stage_verify(self):
self.log("Final node verification")
try:
self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel rollout status ds/kube-flannel-ds --timeout=10m",
)
except Exception:
self.log("Flannel rollout failed; collecting diagnostics")
proc = self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get ds -o wide || true",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o wide || true",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do echo \"--- describe $p ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel describe $p || true; done",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do echo \"--- logs $p kube-flannel ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c kube-flannel --tail=120 || true; echo \"--- logs $p install-cni-plugin ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c install-cni-plugin --tail=120 || true; echo \"--- logs $p install-cni ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c install-cni --tail=120 || true; done",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs --tail=120 $p || true; done",
check=False,
)
print(proc.stdout)
raise
self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf wait --for=condition=Ready nodes --all --timeout=10m",
)
proc = self.remote(self.primary_ip, "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide")
print(proc.stdout)
def reconcile(self):
self.stage_preflight()
self.stage_rebuild()
self.stage_init_primary()
self.stage_install_cni()
self.stage_join_control_planes()
self.stage_join_workers()
self.stage_verify()
def main():
parser = argparse.ArgumentParser(description="TerraHome kubeadm bootstrap controller")
parser.add_argument("command", choices=[
"reconcile",
"preflight",
"rebuild",
"init-primary",
"install-cni",
"join-control-planes",
"join-workers",
"verify",
])
parser.add_argument("--inventory", default=str(Path(__file__).resolve().parent.parent / "scripts" / "inventory.env"))
args = parser.parse_args()
cfg = load_inventory(args.inventory)
ctl = Controller(cfg)
dispatch = {
"reconcile": ctl.reconcile,
"preflight": ctl.stage_preflight,
"rebuild": ctl.stage_rebuild,
"init-primary": ctl.stage_init_primary,
"install-cni": ctl.stage_install_cni,
"join-control-planes": ctl.stage_join_control_planes,
"join-workers": ctl.stage_join_workers,
"verify": ctl.stage_verify,
}
try:
dispatch[args.command]()
except Exception as exc:
print(f"ERROR: {exc}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

27
nixos/kubeadm/flake.lock generated Normal file
View File

@@ -0,0 +1,27 @@
{
"nodes": {
"nixpkgs": {
"locked": {
"lastModified": 1767313136,
"narHash": "sha256-16KkgfdYqjaeRGBaYsNrhPRRENs0qzkQVUooNHtoy2w=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "ac62194c3917d5f474c1a844b6fd6da2db95077d",
"type": "github"
},
"original": {
"owner": "NixOS",
"ref": "nixos-25.05",
"repo": "nixpkgs",
"type": "github"
}
},
"root": {
"inputs": {
"nixpkgs": "nixpkgs"
}
}
},
"root": "root",
"version": 7
}

77
nixos/kubeadm/flake.nix Normal file
View File

@@ -0,0 +1,77 @@
{
description = "NixOS kubeadm cluster configs";
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.05";
};
outputs = { nixpkgs, ... }:
let
system = "x86_64-linux";
lib = nixpkgs.lib;
pkgs = nixpkgs.legacyPackages.${system};
nodeNames = [ "cp-1" "cp-2" "cp-3" "wk-1" "wk-2" "wk-3" ];
mkNode = {
name,
role,
extraModules ? [ ],
}:
let
roleModule = if role == "control-plane" then ./modules/k8s-control-plane.nix else ./modules/k8s-worker.nix;
hardwarePath = ./hosts/hardware + "/${name}.nix";
in
nixpkgs.lib.nixosSystem {
inherit system;
modules = [
./modules/k8s-cluster-settings.nix
./modules/k8s-common.nix
roleModule
({ lib, ... }: {
imports = lib.optional (builtins.pathExists hardwarePath) hardwarePath;
networking.hostName = name;
system.stateVersion = "25.05";
boot.loader.grub.devices = lib.mkDefault [ "/dev/sda" ];
fileSystems."/" = lib.mkDefault {
device = "/dev/disk/by-label/nixos";
fsType = "ext4";
};
})
] ++ extraModules;
};
mkNodeByName = name:
mkNode {
inherit name;
role = if lib.hasPrefix "cp-" name then "control-plane" else "worker";
};
mkEvalCheck = name:
let
cfg = mkNode {
inherit name;
role = if lib.hasPrefix "cp-" name then "control-plane" else "worker";
extraModules = [
({ lib, ... }: {
boot.loader.grub.devices = lib.mkDefault [ "/dev/sda" ];
fileSystems."/" = lib.mkDefault {
device = "/dev/disk/by-label/nixos";
fsType = "ext4";
};
})
];
};
in
pkgs.runCommand "eval-${name}" { } ''
cat > "$out" <<'EOF'
host=${cfg.config.networking.hostName}
role=${if lib.hasPrefix "cp-" name then "control-plane" else "worker"}
stateVersion=${cfg.config.system.stateVersion}
EOF
'';
in {
nixosConfigurations = lib.genAttrs nodeNames mkNodeByName;
checks.${system} = lib.genAttrs nodeNames mkEvalCheck;
};
}

View File

View File

@@ -0,0 +1,212 @@
---
kind: Namespace
apiVersion: v1
metadata:
name: kube-flannel
labels:
k8s-app: flannel
pod-security.kubernetes.io/enforce: privileged
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
labels:
k8s-app: flannel
name: flannel
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
labels:
k8s-app: flannel
name: flannel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: flannel
subjects:
- kind: ServiceAccount
name: flannel
namespace: kube-flannel
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-app: flannel
name: flannel
namespace: kube-flannel
---
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-flannel
labels:
tier: node
k8s-app: flannel
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.244.0.0/16",
"EnableNFTables": false,
"Backend": {
"Type": "vxlan"
}
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-flannel-ds
namespace: kube-flannel
labels:
tier: node
app: flannel
k8s-app: flannel
spec:
selector:
matchLabels:
app: flannel
template:
metadata:
labels:
tier: node
app: flannel
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
hostNetwork: true
priorityClassName: system-node-critical
tolerations:
- operator: Exists
effect: NoSchedule
serviceAccountName: flannel
initContainers:
- name: install-cni-plugin
image: docker.io/flannel/flannel-cni-plugin:v1.5.1-flannel1
command:
- cp
args:
- -f
- /flannel
- /opt/cni/bin/flannel
volumeMounts:
- name: cni-plugin
mountPath: /opt/cni/bin
- name: install-cni
image: docker.io/flannel/flannel:v0.25.5
command:
- cp
args:
- -f
- /etc/kube-flannel/cni-conf.json
- /etc/cni/net.d/10-flannel.conflist
volumeMounts:
- name: cni
mountPath: /etc/cni/net.d
- name: flannel-cfg
mountPath: /etc/kube-flannel/
containers:
- name: kube-flannel
image: docker.io/flannel/flannel:v0.25.5
command:
- /opt/bin/flanneld
args:
- --ip-masq
- --kube-subnet-mgr
resources:
requests:
cpu: "100m"
memory: "50Mi"
securityContext:
privileged: false
capabilities:
add: ["NET_ADMIN", "NET_RAW"]
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: EVENT_QUEUE_DEPTH
value: "5000"
volumeMounts:
- name: run
mountPath: /run/flannel
- name: flannel-cfg
mountPath: /etc/kube-flannel/
- name: xtables-lock
mountPath: /run/xtables.lock
volumes:
- name: run
hostPath:
path: /run/flannel
type: DirectoryOrCreate
- name: cni-plugin
hostPath:
path: /opt/cni/bin
type: DirectoryOrCreate
- name: cni
hostPath:
path: /etc/cni/net.d
type: DirectoryOrCreate
- name: flannel-cfg
configMap:
name: kube-flannel-cfg
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate

View File

@@ -0,0 +1,12 @@
{ ... }:
{
terrahome.kubeadm = {
k8sMinor = "1.31";
controlPlaneInterface = "eth0";
controlPlaneVipSuffix = 250;
podSubnet = "10.244.0.0/16";
serviceSubnet = "10.96.0.0/12";
clusterDomain = "cluster.local";
};
}

View File

@@ -0,0 +1,420 @@
{ config, lib, pkgs, ... }:
let
pinnedK8s = lib.attrByPath [ "kubernetes_1_31" ] pkgs.kubernetes pkgs;
kubeVipImage = "ghcr.io/kube-vip/kube-vip:v0.8.9";
in
{
options.terrahome.kubeadm = {
k8sMinor = lib.mkOption {
type = lib.types.str;
default = "1.31";
};
controlPlaneInterface = lib.mkOption {
type = lib.types.str;
default = "eth0";
};
controlPlaneVipSuffix = lib.mkOption {
type = lib.types.int;
default = 250;
};
podSubnet = lib.mkOption {
type = lib.types.str;
default = "10.244.0.0/16";
};
serviceSubnet = lib.mkOption {
type = lib.types.str;
default = "10.96.0.0/12";
};
clusterDomain = lib.mkOption {
type = lib.types.str;
default = "cluster.local";
};
};
config = {
boot.kernelModules = [ "overlay" "br_netfilter" ];
boot.kernel.sysctl = {
"net.ipv4.ip_forward" = 1;
"net.bridge.bridge-nf-call-iptables" = 1;
"net.bridge.bridge-nf-call-ip6tables" = 1;
};
virtualisation.containerd.enable = true;
virtualisation.containerd.settings = {
plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options.SystemdCgroup = true;
};
swapDevices = lib.mkForce [ ];
services.openssh.enable = true;
services.openssh.settings = {
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
};
users.users.micqdf = {
isNormalUser = true;
extraGroups = [ "wheel" ];
};
security.sudo.wheelNeedsPassword = false;
nix.settings.trusted-users = [ "root" "micqdf" ];
nix.gc = {
automatic = true;
dates = "daily";
options = "--delete-older-than 3d";
};
nix.settings.auto-optimise-store = true;
environment.variables = {
KUBECONFIG = "/etc/kubernetes/admin.conf";
KUBE_VIP_IMAGE = kubeVipImage;
};
environment.systemPackages = (with pkgs; [
containerd
cri-tools
cni-plugins
pinnedK8s
kubernetes-helm
conntrack-tools
socat
ethtool
ipvsadm
iproute2
iptables
ebtables
jq
curl
vim
gawk
]) ++ [
(pkgs.writeShellScriptBin "th-kubeadm-init" ''
set -euo pipefail
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY no_proxy NO_PROXY
iface="${config.terrahome.kubeadm.controlPlaneInterface}"
if ! ip link show "$iface" >/dev/null 2>&1; then
iface="$(ip -o -4 route show to default | awk 'NR==1 {print $5}')"
fi
if [ -z "''${iface:-}" ]; then
echo "Could not determine network interface for kube-vip"
exit 1
fi
suffix="${toString config.terrahome.kubeadm.controlPlaneVipSuffix}"
pod_subnet="${config.terrahome.kubeadm.podSubnet}"
service_subnet="${config.terrahome.kubeadm.serviceSubnet}"
domain="${config.terrahome.kubeadm.clusterDomain}"
node_name="${config.networking.hostName}"
local_ip_cidr=$(ip -4 -o addr show dev "$iface" | awk 'NR==1 {print $4}')
if [ -z "''${local_ip_cidr:-}" ]; then
echo "Could not determine IPv4 CIDR on interface $iface"
exit 1
fi
subnet_prefix=$(echo "$local_ip_cidr" | cut -d/ -f1 | awk -F. '{print $1"."$2"."$3}')
vip="$subnet_prefix.$suffix"
echo "Using control-plane endpoint: $vip:6443"
echo "Using kube-vip interface: $iface"
echo "Using kubeadm node name: $node_name"
hostname "$node_name" || true
rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
systemctl unmask kubelet || true
systemctl stop kubelet || true
systemctl reset-failed kubelet || true
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
rm -rf /var/lib/kubelet/pki
systemctl daemon-reload
systemctl unmask kubelet || true
systemctl enable kubelet || true
echo "==> Ensuring containerd is running"
systemctl start containerd || true
sleep 2
if ! systemctl is-active containerd; then
echo "ERROR: containerd not running"
journalctl -xeu containerd --no-pager -n 30
exit 1
fi
mkdir -p /etc/kubernetes/manifests
mkdir -p /tmp/kubeadm
cat > /tmp/kubeadm/init-config.yaml << 'KUBEADMCONFIG'
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
name: "KUBEADM_NODE_NAME"
criSocket: unix:///run/containerd/containerd.sock
kubeletExtraArgs:
- name: hostname-override
value: "KUBEADM_NODE_NAME"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
controlPlaneEndpoint: "KUBEADM_ENDPOINT"
networking:
podSubnet: "KUBEADM_POD_SUBNET"
serviceSubnet: "KUBEADM_SERVICE_SUBNET"
dnsDomain: "KUBEADM_DNS_DOMAIN"
KUBEADMCONFIG
sed -i "s|KUBEADM_ENDPOINT|$vip:6443|g" /tmp/kubeadm/init-config.yaml
sed -i "s|KUBEADM_POD_SUBNET|$pod_subnet|g" /tmp/kubeadm/init-config.yaml
sed -i "s|KUBEADM_SERVICE_SUBNET|$service_subnet|g" /tmp/kubeadm/init-config.yaml
sed -i "s|KUBEADM_DNS_DOMAIN|$domain|g" /tmp/kubeadm/init-config.yaml
sed -i "s|KUBEADM_NODE_NAME|$node_name|g" /tmp/kubeadm/init-config.yaml
echo "==> Pre-pulling kubeadm images"
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm config images pull --config /tmp/kubeadm/init-config.yaml || true
echo "==> Creating kube-vip static pod manifest"
ctr image pull "${kubeVipImage}"
ctr run --rm --net-host "${kubeVipImage}" kube-vip-manifest /kube-vip manifest pod \
--log 4 \
--interface "$iface" \
--address "$vip" \
--controlplane \
--arp \
> /etc/kubernetes/manifests/kube-vip.yaml
# kube-vip bootstrap workaround for Kubernetes >=1.29.
# During early kubeadm phases, super-admin.conf is available before admin.conf is fully usable.
sed -i 's#path: /etc/kubernetes/admin.conf#path: /etc/kubernetes/super-admin.conf#' /etc/kubernetes/manifests/kube-vip.yaml || true
echo "==> kube-vip manifest kubeconfig mount"
grep -E 'mountPath:|path:' /etc/kubernetes/manifests/kube-vip.yaml | grep -E 'kubernetes/(admin|super-admin)\.conf' || true
KUBEADM_INIT_LOG=/tmp/kubeadm-init.log
if ! env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm init \
--config /tmp/kubeadm/init-config.yaml \
--upload-certs \
--ignore-preflight-errors=NumCPU,HTTPProxyCIDR,Port-10250 2>&1 | tee "$KUBEADM_INIT_LOG"; then
if grep -q "error writing CRISocket for this node: nodes" "$KUBEADM_INIT_LOG" && [ -f /etc/kubernetes/admin.conf ]; then
echo "==> kubeadm hit CRISocket race; waiting for node registration"
echo "==> forcing kubelet restart to pick bootstrap flags"
systemctl daemon-reload || true
systemctl restart kubelet || true
sleep 3
echo "==> kubelet bootstrap flags"
cat /var/lib/kubelet/kubeadm-flags.env || true
registered=0
for i in $(seq 1 60); do
if KUBECONFIG=/etc/kubernetes/admin.conf kubectl get node "$node_name" >/dev/null 2>&1; then
echo "==> node $node_name registered; uploading kubelet config"
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm init phase upload-config kubelet --config /tmp/kubeadm/init-config.yaml
registered=1
break
fi
sleep 2
done
if [ "$registered" -ne 1 ]; then
echo "==> node $node_name did not register after kubeadm init failure"
KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes -o wide || true
echo "==> kubelet logs (registration hints)"
journalctl -u kubelet --no-pager -n 120 | grep -Ei "register|node|bootstrap|certificate|forbidden|unauthorized|refused|x509" || true
exit 1
fi
else
echo "==> kubeadm init failed, checking pod status:"
crictl pods || true
crictl ps -a || true
echo "==> kube-vip containers:"
crictl ps -a --name kube-vip || true
echo "==> kube-vip logs:"
for container_id in $(crictl ps -a --name kube-vip -q 2>/dev/null); do
echo "--- kube-vip container $container_id ---"
crictl logs "$container_id" 2>/dev/null || true
crictl inspect "$container_id" 2>/dev/null | jq -r '.status | "exitCode=\(.exitCode) reason=\(.reason // "") message=\(.message // "")"' || true
done
echo "==> Checking if VIP is bound:"
ip -4 addr show | grep "$vip" || echo "VIP NOT BOUND"
echo "==> kubelet logs:"
journalctl -xeu kubelet --no-pager -n 50
exit 1
fi
fi
echo "==> Waiting for kube-vip to claim VIP $vip"
for i in $(seq 1 90); do
if ip -4 addr show | grep -q "$vip"; then
echo "==> VIP $vip is bound"
break
fi
if [ "$i" -eq 90 ]; then
echo "==> ERROR: VIP not bound after 3 minutes"
crictl ps -a --name kube-vip || true
for container_id in $(crictl ps -a --name kube-vip -q 2>/dev/null); do
echo "--- kube-vip container $container_id ---"
crictl logs "$container_id" 2>/dev/null || true
done
exit 1
fi
sleep 2
done
echo "==> Waiting for API server to be ready"
for i in $(seq 1 60); do
if curl -sk "https://$vip:6443/healthz" 2>/dev/null | grep -q "ok"; then
echo "==> API server is healthy"
break
fi
if [ "$i" -eq 60 ]; then
echo "==> ERROR: API server not healthy after 2 minutes"
crictl pods || true
crictl ps -a || true
exit 1
fi
sleep 2
done
# Switch kube-vip to normal admin.conf after bootstrap finishes.
sed -i 's#path: /etc/kubernetes/super-admin.conf#path: /etc/kubernetes/admin.conf#' /etc/kubernetes/manifests/kube-vip.yaml || true
mkdir -p /root/.kube
cp /etc/kubernetes/admin.conf /root/.kube/config
chmod 600 /root/.kube/config
echo
echo "Next: install Cilium, then generate join commands:"
echo " kubeadm token create --print-join-command"
echo " kubeadm token create --print-join-command --certificate-key <key>"
'')
(pkgs.writeShellScriptBin "th-kubeadm-join-control-plane" ''
set -euo pipefail
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY no_proxy NO_PROXY
if [ "$#" -lt 1 ]; then
echo "Usage: th-kubeadm-join-control-plane '<kubeadm join ... --control-plane --certificate-key ...>'"
exit 1
fi
iface="${config.terrahome.kubeadm.controlPlaneInterface}"
if ! ip link show "$iface" >/dev/null 2>&1; then
iface="$(ip -o -4 route show to default | awk 'NR==1 {print $5}')"
fi
if [ -z "''${iface:-}" ]; then
echo "Could not determine network interface for kube-vip"
exit 1
fi
suffix="${toString config.terrahome.kubeadm.controlPlaneVipSuffix}"
local_ip_cidr=$(ip -4 -o addr show dev "$iface" | awk 'NR==1 {print $4}')
if [ -z "''${local_ip_cidr:-}" ]; then
echo "Could not determine IPv4 CIDR on interface $iface"
exit 1
fi
subnet_prefix=$(echo "$local_ip_cidr" | cut -d/ -f1 | awk -F. '{print $1"."$2"."$3}')
vip="$subnet_prefix.$suffix"
mkdir -p /etc/kubernetes/manifests
ctr image pull "${kubeVipImage}"
ctr run --rm --net-host "${kubeVipImage}" kube-vip /kube-vip manifest pod \
--log 4 \
--interface "$iface" \
--address "$vip" \
--controlplane \
--arp \
--leaderElection \
> /etc/kubernetes/manifests/kube-vip.yaml
rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
rm -rf /var/lib/kubelet/pki
systemctl unmask kubelet || true
systemctl stop kubelet || true
systemctl enable kubelet || true
systemctl reset-failed kubelet || true
systemctl daemon-reload
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
eval "$1"
'')
(pkgs.writeShellScriptBin "th-kubeadm-join-worker" ''
set -euo pipefail
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY no_proxy NO_PROXY
if [ "$#" -lt 1 ]; then
echo "Usage: th-kubeadm-join-worker '<kubeadm join ...>'"
exit 1
fi
rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
rm -rf /var/lib/kubelet/pki
systemctl unmask kubelet || true
systemctl stop kubelet || true
systemctl enable kubelet || true
systemctl reset-failed kubelet || true
systemctl daemon-reload
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
eval "$1"
'')
(pkgs.writeShellScriptBin "th-kubeadm-status" ''
set -euo pipefail
systemctl is-active containerd || true
systemctl is-active kubelet || true
crictl info >/dev/null && echo "crictl: ok" || echo "crictl: not-ready"
'')
];
systemd.services.kubelet = {
description = "Kubernetes Kubelet";
wantedBy = [ "multi-user.target" ];
path = [ pkgs.util-linux ];
wants = [ "network-online.target" ];
after = [ "containerd.service" "network-online.target" ];
serviceConfig = {
Environment = [
"KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
"KUBELET_KUBEADM_ARGS="
"KUBELET_EXTRA_ARGS="
];
EnvironmentFile = [
"-/var/lib/kubelet/kubeadm-flags.env"
"-/etc/default/kubelet"
];
ExecStart = "${pinnedK8s}/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf \$KUBELET_CONFIG_ARGS \$KUBELET_KUBEADM_ARGS \$KUBELET_EXTRA_ARGS";
Restart = "on-failure";
RestartSec = "10";
};
unitConfig = {
ConditionPathExists = "/var/lib/kubelet/config.yaml";
ConditionPathExistsGlob = "/etc/kubernetes/*kubelet.conf";
};
};
systemd.tmpfiles.rules = [
"d /etc/kubernetes 0755 root root -"
"d /etc/kubernetes/manifests 0755 root root -"
"d /etc/cni/net.d 0755 root root -"
"d /opt/cni/bin 0755 root root -"
"d /run/flannel 0755 root root -"
"d /var/lib/kubelet 0755 root root -"
"d /var/lib/kubelet/pki 0755 root root -"
];
};
}

View File

@@ -0,0 +1,14 @@
{
networking.firewall.allowedTCPPorts = [
6443
2379
2380
10250
10257
10259
];
networking.firewall.allowedUDPPorts = [
8472
];
}

View File

@@ -0,0 +1,11 @@
{
networking.firewall.allowedTCPPorts = [
10250
30000
32767
];
networking.firewall.allowedUDPPorts = [
8472
];
}

View File

@@ -0,0 +1,182 @@
#!/usr/bin/env python3
import concurrent.futures
import ipaddress
import json
import os
import subprocess
import sys
from typing import Dict, Set, Tuple
def derive_prefix(payload: dict) -> str:
explicit = os.environ.get("KUBEADM_SUBNET_PREFIX", "").strip()
if explicit:
return explicit
for key in ("control_plane_vm_ipv4", "worker_vm_ipv4"):
values = payload.get(key, {}).get("value", {})
for ip in values.values():
if ip:
parts = ip.split(".")
if len(parts) == 4:
return ".".join(parts[:3])
return "10.27.27"
def ssh_probe(ip: str, users: list[str], key_path: str, timeout_sec: int) -> Tuple[str, str, str] | None:
cmd_tail = [
"-o",
"BatchMode=yes",
"-o",
"IdentitiesOnly=yes",
"-o",
"StrictHostKeyChecking=accept-new",
"-o",
f"ConnectTimeout={timeout_sec}",
"-i",
key_path,
]
for user in users:
cmd = [
"ssh",
*cmd_tail,
f"{user}@{ip}",
"hn=$(hostnamectl --static 2>/dev/null || hostname); serial=$(cat /sys/class/dmi/id/product_serial 2>/dev/null || true); printf '%s|%s\n' \"$hn\" \"$serial\"",
]
try:
out = subprocess.check_output(cmd, stderr=subprocess.DEVNULL, text=True, timeout=timeout_sec + 2).strip()
except Exception:
continue
if out:
line = out.splitlines()[0].strip()
if "|" in line:
host, serial = line.split("|", 1)
else:
host, serial = line, ""
return host.strip(), ip, serial.strip()
return None
def build_inventory(names: Set[str], found: Dict[str, str], ssh_user: str) -> str:
cp = sorted([n for n in names if n.startswith("cp-")], key=lambda x: int(x.split("-")[1]))
wk = sorted([n for n in names if n.startswith("wk-")], key=lambda x: int(x.split("-")[1]))
cp_pairs = " ".join(f"{n}={found[n]}" for n in cp)
wk_pairs = " ".join(f"{n}={found[n]}" for n in wk)
primary = cp[0] if cp else "cp-1"
return "\n".join(
[
f"SSH_USER={ssh_user}",
f"PRIMARY_CONTROL_PLANE={primary}",
f'CONTROL_PLANES="{cp_pairs}"',
f'WORKERS="{wk_pairs}"',
"",
]
)
def main() -> int:
payload = json.load(sys.stdin)
cp_names = set(payload.get("control_plane_vm_ids", {}).get("value", {}).keys())
wk_names = set(payload.get("worker_vm_ids", {}).get("value", {}).keys())
target_names = cp_names | wk_names
if not target_names:
raise SystemExit("Could not determine target node names from Terraform outputs")
ssh_user = os.environ.get("KUBEADM_SSH_USER", "").strip() or "micqdf"
users = [u for u in os.environ.get("SSH_USER_CANDIDATES", f"{ssh_user} root").split() if u]
key_path = os.environ.get("SSH_KEY_PATH", os.path.expanduser("~/.ssh/id_ed25519"))
timeout_sec = int(os.environ.get("SSH_DISCOVERY_TIMEOUT_SEC", "6"))
max_workers = int(os.environ.get("SSH_DISCOVERY_WORKERS", "32"))
prefix = derive_prefix(payload)
start = int(os.environ.get("KUBEADM_SUBNET_START", "2"))
end = int(os.environ.get("KUBEADM_SUBNET_END", "254"))
vip_suffix = int(os.environ.get("KUBEADM_CONTROL_PLANE_VIP_SUFFIX", "250"))
def is_vip_ip(ip: str) -> bool:
try:
return int(ip.split(".")[-1]) == vip_suffix
except Exception:
return False
scan_ips = [
str(ipaddress.IPv4Address(f"{prefix}.{i}"))
for i in range(start, end + 1)
if i != vip_suffix
]
found: Dict[str, str] = {}
vmid_to_name: Dict[str, str] = {}
for name, vmid in payload.get("control_plane_vm_ids", {}).get("value", {}).items():
vmid_to_name[str(vmid)] = name
for name, vmid in payload.get("worker_vm_ids", {}).get("value", {}).items():
vmid_to_name[str(vmid)] = name
seen_hostnames: Dict[str, str] = {}
seen_ips: Dict[str, Tuple[str, str]] = {}
def run_pass(pass_timeout: int, pass_workers: int) -> None:
with concurrent.futures.ThreadPoolExecutor(max_workers=pass_workers) as pool:
futures = [pool.submit(ssh_probe, ip, users, key_path, pass_timeout) for ip in scan_ips]
for fut in concurrent.futures.as_completed(futures):
result = fut.result()
if not result:
continue
host, ip, serial = result
if host not in seen_hostnames:
seen_hostnames[host] = ip
if ip not in seen_ips:
seen_ips[ip] = (host, serial)
target = None
if serial in vmid_to_name:
inferred = vmid_to_name[serial]
target = inferred
elif host in target_names:
target = host
if target:
existing = found.get(target)
if existing is None or (is_vip_ip(existing) and not is_vip_ip(ip)):
found[target] = ip
if all(name in found for name in target_names):
return
run_pass(timeout_sec, max_workers)
if not all(name in found for name in target_names):
# Slower second pass for busy runners/networks.
run_pass(max(timeout_sec + 2, 8), max(8, max_workers // 2))
# Heuristic fallback: if nodes still missing, assign from remaining SSH-reachable
# IPs not already used, ordered by IP. This helps when cloned nodes temporarily
# share a generic hostname (e.g. "flex") and DMI serial mapping is unavailable.
missing = sorted([n for n in target_names if n not in found])
if missing:
used_ips = set(found.values())
candidates = sorted(ip for ip in seen_ips.keys() if ip not in used_ips)
if len(candidates) >= len(missing):
for name, ip in zip(missing, candidates):
found[name] = ip
missing = sorted([n for n in target_names if n not in found])
if missing:
discovered = ", ".join(sorted(seen_hostnames.keys())[:20])
if discovered:
sys.stderr.write(f"Discovered hostnames during scan: {discovered}\n")
if seen_ips:
sample = ", ".join(f"{ip}={meta[0]}" for ip, meta in list(sorted(seen_ips.items()))[:20])
sys.stderr.write(f"SSH-reachable IPs: {sample}\n")
raise SystemExit(
"Failed SSH-based IP discovery for nodes: " + ", ".join(missing) +
f" (scanned {prefix}.{start}-{prefix}.{end})"
)
sys.stdout.write(build_inventory(target_names, found, ssh_user))
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -0,0 +1,7 @@
SSH_USER=micqdf
PRIMARY_CONTROL_PLANE=cp-1
# Name=IP pairs (space-separated)
CONTROL_PLANES="cp-1=192.168.1.101 cp-2=192.168.1.102 cp-3=192.168.1.103"
WORKERS="wk-1=192.168.1.111 wk-2=192.168.1.112 wk-3=192.168.1.113"

View File

@@ -0,0 +1,14 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INVENTORY_FILE="${1:-$SCRIPT_DIR/inventory.env}"
CONTROLLER="$SCRIPT_DIR/../bootstrap/controller.py"
if [ ! -f "$INVENTORY_FILE" ]; then
echo "Missing inventory file: $INVENTORY_FILE"
echo "Copy $SCRIPT_DIR/inventory.example.env to $SCRIPT_DIR/inventory.env and edit node mappings."
exit 1
fi
python3 "$CONTROLLER" reconcile --inventory "$INVENTORY_FILE"

View File

@@ -0,0 +1,65 @@
#!/usr/bin/env python3
import json
import os
import re
import sys
def natural_key(name: str):
m = re.match(r"^([a-zA-Z-]+)-(\d+)$", name)
if m:
return (m.group(1), int(m.group(2)))
return (name, 0)
def map_to_pairs(items: dict[str, str]) -> str:
ordered = sorted(items.items(), key=lambda kv: natural_key(kv[0]))
return " ".join(f"{k}={v}" for k, v in ordered)
def require_non_empty_ips(label: str, items: dict[str, str]) -> dict[str, str]:
cleaned: dict[str, str] = {}
missing: list[str] = []
for name, ip in items.items():
ip_value = (ip or "").strip()
if not ip_value:
missing.append(name)
continue
cleaned[name] = ip_value
if missing:
names = ", ".join(sorted(missing, key=natural_key))
raise SystemExit(
f"Missing IPv4 addresses for {label}: {names}. "
"Terraform outputs are present but empty. "
"This usually means Proxmox guest IP discovery is unavailable for these VMs yet."
)
return cleaned
def main() -> int:
payload = json.load(sys.stdin)
cp_map = payload.get("control_plane_vm_ipv4", {}).get("value", {})
wk_map = payload.get("worker_vm_ipv4", {}).get("value", {})
if not cp_map or not wk_map:
raise SystemExit("Missing control_plane_vm_ipv4 or worker_vm_ipv4 in terraform output")
cp_map = require_non_empty_ips("control planes", cp_map)
wk_map = require_non_empty_ips("workers", wk_map)
ssh_user = os.environ.get("KUBEADM_SSH_USER", "").strip() or "micqdf"
print(f"SSH_USER={ssh_user}")
print("PRIMARY_CONTROL_PLANE=cp-1")
print(f"CONTROL_PLANES=\"{map_to_pairs(cp_map)}\"")
print(f"WORKERS=\"{map_to_pairs(wk_map)}\"")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -0,0 +1,106 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INVENTORY_FILE="${1:-$SCRIPT_DIR/inventory.env}"
if [ ! -f "$INVENTORY_FILE" ]; then
echo "Missing inventory file: $INVENTORY_FILE"
echo "Copy $SCRIPT_DIR/inventory.example.env to $SCRIPT_DIR/inventory.env and edit node mappings."
exit 1
fi
# shellcheck disable=SC1090
source "$INVENTORY_FILE"
SSH_USER="${SSH_USER:-micqdf}"
SSH_KEY_PATH="${SSH_KEY_PATH:-$HOME/.ssh/id_ed25519}"
SSH_OPTS="${SSH_OPTS:--o BatchMode=yes -o IdentitiesOnly=yes -o StrictHostKeyChecking=accept-new -i $SSH_KEY_PATH}"
SSH_USER_CANDIDATES="${SSH_USER_CANDIDATES:-root $SSH_USER}"
declare -A NODE_IPS=()
add_pair() {
local pair="$1"
local name="${pair%%=*}"
local ip="${pair#*=}"
if [ -z "$name" ] || [ -z "$ip" ] || [ "$name" = "$ip" ]; then
echo "Invalid node pair '$pair' (expected name=ip)."
exit 1
fi
NODE_IPS["$name"]="$ip"
}
if [ -n "${CONTROL_PLANES:-}" ]; then
for pair in $CONTROL_PLANES; do
add_pair "$pair"
done
else
while IFS= read -r var_name; do
idx="${var_name#CP_}"
add_pair "cp-$idx=${!var_name}"
done < <(compgen -A variable | grep -E '^CP_[0-9]+$' | sort -V)
fi
if [ -n "${WORKERS:-}" ]; then
for pair in $WORKERS; do
add_pair "$pair"
done
else
while IFS= read -r var_name; do
idx="${var_name#WK_}"
add_pair "wk-$idx=${!var_name}"
done < <(compgen -A variable | grep -E '^WK_[0-9]+$' | sort -V)
fi
if [ "${#NODE_IPS[@]}" -eq 0 ]; then
echo "No nodes found in inventory."
exit 1
fi
detect_ssh_user() {
local probe_ip="$1"
local candidate
for candidate in $SSH_USER_CANDIDATES; do
if ssh $SSH_OPTS "$candidate@$probe_ip" "true" >/dev/null 2>&1; then
ACTIVE_SSH_USER="$candidate"
echo "==> Using SSH user '$ACTIVE_SSH_USER'"
return 0
fi
done
echo "Unable to authenticate to $probe_ip with candidates: $SSH_USER_CANDIDATES"
return 1
}
mkdir -p "$HOME/.ssh"
chmod 700 "$HOME/.ssh"
touch "$HOME/.ssh/known_hosts"
chmod 600 "$HOME/.ssh/known_hosts"
for node_name in "${!NODE_IPS[@]}"; do
ssh-keygen -R "${NODE_IPS[$node_name]}" >/dev/null 2>&1 || true
ssh-keyscan -H "${NODE_IPS[$node_name]}" >> "$HOME/.ssh/known_hosts" 2>/dev/null || true
done
reset_node() {
local node_name="$1"
local node_ip="$2"
echo "==> Resetting $node_name ($node_ip)"
local cmd="sudo kubeadm reset -f && sudo systemctl stop kubelet && sudo rm -rf /etc/kubernetes /var/lib/etcd /var/lib/cni /etc/cni/net.d"
local quoted_cmd
quoted_cmd="$(printf '%q' "$cmd")"
ssh $SSH_OPTS "$ACTIVE_SSH_USER@$node_ip" "bash -lc $quoted_cmd"
}
FIRST_NODE_IP="${NODE_IPS[$(printf '%s\n' "${!NODE_IPS[@]}" | sort -V | head -n1)]}"
ACTIVE_SSH_USER="$SSH_USER"
detect_ssh_user "$FIRST_NODE_IP"
while IFS= read -r node_name; do
reset_node "$node_name" "${NODE_IPS[$node_name]}"
done < <(printf '%s\n' "${!NODE_IPS[@]}" | sort -V)
echo "Cluster components reset on all listed nodes."

View File

@@ -1,17 +1,16 @@
# NixOS Proxmox Template Base
# NixOS Proxmox k8s-base Template
This folder contains a minimal NixOS base config you can copy into a new
This folder contains a Kubernetes-ready NixOS base config for your Proxmox
template VM build.
## Files
- `flake.nix`: pins `nixos-24.11` and exposes one host config.
- `configuration.nix`: base settings for Proxmox guest use.
- `flake.nix`: pins `nixos-25.05` and exposes one host config.
- `configuration.nix`: k8s-base settings for Proxmox guests.
## Before first apply
1. Replace `REPLACE_WITH_YOUR_SSH_PUBLIC_KEY` in `configuration.nix`.
2. Add `hardware-configuration.nix` from the VM install:
1. Add `hardware-configuration.nix` from the VM install:
- `nixos-generate-config --root /`
- copy `/etc/nixos/hardware-configuration.nix` next to `configuration.nix`
@@ -23,5 +22,6 @@ sudo nixos-rebuild switch --flake .#template
## Notes
- This is intentionally minimal and avoids cloud-init assumptions.
- If you want host-specific settings, create additional modules and import them.
- This pre-installs heavy shared Kubernetes dependencies (containerd + kube tools)
to reduce per-node bootstrap time.
- Cloud-init still injects the runtime SSH key and per-node hostname/IP.

View File

@@ -1,12 +1,17 @@
{ lib, pkgs, ... }:
let
pinnedK8s = lib.attrByPath [ "kubernetes_1_31" ] pkgs.kubernetes pkgs;
in
{
imports =
lib.optional (builtins.pathExists ./hardware-configuration.nix)
./hardware-configuration.nix;
networking.hostName = "nixos-template";
networking.hostName = "k8s-base-template";
networking.useDHCP = lib.mkDefault true;
networking.useNetworkd = true;
networking.nameservers = [ "1.1.1.1" "8.8.8.8" ];
boot.loader.systemd-boot.enable = lib.mkForce false;
@@ -16,14 +21,40 @@
};
services.qemuGuest.enable = true;
services.cloud-init.enable = true;
services.cloud-init.network.enable = true;
services.openssh.enable = true;
services.tailscale.enable = true;
services.openssh.settings = {
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
PermitRootLogin = "prohibit-password";
};
boot.kernelModules = [ "overlay" "br_netfilter" ];
boot.kernel.sysctl = {
"net.ipv4.ip_forward" = 1;
"net.bridge.bridge-nf-call-iptables" = 1;
"net.bridge.bridge-nf-call-ip6tables" = 1;
};
virtualisation.containerd.enable = true;
virtualisation.containerd.settings = {
plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options.SystemdCgroup = true;
};
swapDevices = lib.mkForce [ ];
nix.settings = {
trusted-users = [ "root" "micqdf" ];
auto-optimise-store = true;
};
nix.gc = {
automatic = true;
dates = "daily";
options = "--delete-older-than 3d";
};
programs.fish.enable = true;
users.users.micqdf = {
@@ -36,16 +67,27 @@
environment.systemPackages = with pkgs; [
btop
cni-plugins
conntrack-tools
containerd
cri-tools
curl
dig
ebtables
ethtool
eza
fd
fzf
git
htop
iproute2
iptables
ipvsadm
jq
kubernetes-helm
pinnedK8s
ripgrep
tailscale
socat
tree
unzip
vim

27
nixos/template-base/flake.lock generated Normal file
View File

@@ -0,0 +1,27 @@
{
"nodes": {
"nixpkgs": {
"locked": {
"lastModified": 1767313136,
"narHash": "sha256-16KkgfdYqjaeRGBaYsNrhPRRENs0qzkQVUooNHtoy2w=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "ac62194c3917d5f474c1a844b6fd6da2db95077d",
"type": "github"
},
"original": {
"owner": "NixOS",
"ref": "nixos-25.05",
"repo": "nixpkgs",
"type": "github"
}
},
"root": {
"inputs": {
"nixpkgs": "nixpkgs"
}
}
},
"root": "root",
"version": 7
}

View File

@@ -1,8 +1,8 @@
{
description = "Base NixOS config for Proxmox template";
description = "Kubernetes-ready NixOS base template";
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-24.11";
nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.05";
};
outputs = { nixpkgs, ... }: {

View File

@@ -1,42 +1,6 @@
# This file is maintained automatically by "terraform init".
# Manual edits may be lost in future updates.
provider "registry.terraform.io/hashicorp/local" {
version = "2.7.0"
hashes = [
"h1:2RYa3j7m/0WmET2fqotY4CHxE1Hpk0fgn47/126l+Og=",
"zh:261fec71bca13e0a7812dc0d8ae9af2b4326b24d9b2e9beab3d2400fab5c5f9a",
"zh:308da3b5376a9ede815042deec5af1050ec96a5a5410a2206ae847d82070a23e",
"zh:3d056924c420464dc8aba10e1915956b2e5c4d55b11ffff79aa8be563fbfe298",
"zh:643256547b155459c45e0a3e8aab0570db59923c68daf2086be63c444c8c445b",
"zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",
"zh:7aa4d0b853f84205e8cf79f30c9b2c562afbfa63592f7231b6637e5d7a6b5b27",
"zh:7dc251bbc487d58a6ab7f5b07ec9edc630edb45d89b761dba28e0e2ba6b1c11f",
"zh:7ee0ca546cd065030039168d780a15cbbf1765a4c70cd56d394734ab112c93da",
"zh:b1d5d80abb1906e6c6b3685a52a0192b4ca6525fe090881c64ec6f67794b1300",
"zh:d81ea9856d61db3148a4fc6c375bf387a721d78fc1fea7a8823a027272a47a78",
"zh:df0a1f0afc947b8bfc88617c1ad07a689ce3bd1a29fd97318392e6bdd32b230b",
"zh:dfbcad800240e0c68c43e0866f2a751cff09777375ec701918881acf67a268da",
]
}
provider "registry.terraform.io/hashicorp/template" {
version = "2.2.0"
hashes = [
"h1:94qn780bi1qjrbC3uQtjJh3Wkfwd5+tTtJHOb7KTg9w=",
"zh:01702196f0a0492ec07917db7aaa595843d8f171dc195f4c988d2ffca2a06386",
"zh:09aae3da826ba3d7df69efeb25d146a1de0d03e951d35019a0f80e4f58c89b53",
"zh:09ba83c0625b6fe0a954da6fbd0c355ac0b7f07f86c91a2a97849140fea49603",
"zh:0e3a6c8e16f17f19010accd0844187d524580d9fdb0731f675ffcf4afba03d16",
"zh:45f2c594b6f2f34ea663704cc72048b212fe7d16fb4cfd959365fa997228a776",
"zh:77ea3e5a0446784d77114b5e851c970a3dde1e08fa6de38210b8385d7605d451",
"zh:8a154388f3708e3df5a69122a23bdfaf760a523788a5081976b3d5616f7d30ae",
"zh:992843002f2db5a11e626b3fc23dc0c87ad3729b3b3cff08e32ffb3df97edbde",
"zh:ad906f4cebd3ec5e43d5cd6dc8f4c5c9cc3b33d2243c89c5fc18f97f7277b51d",
"zh:c979425ddb256511137ecd093e23283234da0154b7fa8b21c2687182d9aea8b2",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"

View File

@@ -1,12 +0,0 @@
data "template_file" "cloud_init_global" {
template = file("${path.module}/files/cloud_init_global.tpl")
vars = {
SSH_KEY_PUBLIC = var.SSH_KEY_PUBLIC
}
}
resource "local_file" "cloud_init_global" {
content = data.template_file.cloud_init_global.rendered
filename = "${path.module}/files/rendered/cloud_init_global.yaml"
}

View File

@@ -1,4 +1,5 @@
#cloud-config
hostname: ${hostname}
manage_etc_hosts: true
resolv_conf:
nameservers:
@@ -6,6 +7,7 @@ resolv_conf:
- 1.1.1.1
preserve_hostname: false
fqdn: ${hostname}.${domain}
users:
- name: micqdf

View File

@@ -9,6 +9,15 @@ terraform {
}
}
locals {
control_plane_ipconfig = [
for ip in var.control_plane_ips : "ip=${ip}/${var.network_prefix_length},gw=${var.network_gateway}"
]
worker_ipconfig = [
for ip in var.worker_ips : "ip=${ip}/${var.network_prefix_length},gw=${var.network_gateway}"
]
}
provider "proxmox" {
pm_api_url = var.pm_api_url
pm_api_token_id = var.pm_api_token_id
@@ -16,33 +25,35 @@ provider "proxmox" {
pm_tls_insecure = true
}
resource "proxmox_vm_qemu" "alpacas" {
count = var.alpaca_vm_count
name = "alpaca-${count.index + 1}"
vmid = 500 + count.index + 1
target_node = var.target_node
clone = var.clone_template
full_clone = true
os_type = "cloud-init"
agent = 1
resource "proxmox_vm_qemu" "control_planes" {
count = var.control_plane_count
name = "cp-${count.index + 1}"
vmid = var.control_plane_vmid_start + count.index
target_node = var.target_node
clone = var.clone_template
full_clone = true
os_type = "cloud-init"
agent = var.qemu_agent_enabled ? 1 : 0
automatic_reboot = true
cpu {
sockets = var.sockets
cores = var.cores
sockets = 1
cores = var.control_plane_cores
}
memory = var.memory
memory = var.control_plane_memory_mb
scsihw = "virtio-scsi-pci"
boot = "order=scsi0"
bootdisk = "scsi0"
ipconfig0 = "ip=dhcp"
cicustom = "user=local:snippets/cloud_init_global.yaml"
ipconfig0 = local.control_plane_ipconfig[count.index]
ciuser = "micqdf"
sshkeys = var.SSH_KEY_PUBLIC
disks {
scsi {
scsi0 {
disk {
size = var.disk_size
size = var.control_plane_disk_size
storage = var.storage
}
}
@@ -62,35 +73,41 @@ resource "proxmox_vm_qemu" "alpacas" {
model = "virtio"
bridge = var.bridge
}
lifecycle {
ignore_changes = all
}
}
resource "proxmox_vm_qemu" "llamas" {
count = var.llama_vm_count
name = "llama-${count.index + 1}"
vmid = 600 + count.index + 1
target_node = var.target_node
clone = var.clone_template
full_clone = true
os_type = "cloud-init"
agent = 1
resource "proxmox_vm_qemu" "workers" {
count = var.worker_count
name = "wk-${count.index + 1}"
vmid = var.worker_vmid_start + count.index
target_node = var.target_node
clone = var.clone_template
full_clone = true
os_type = "cloud-init"
agent = var.qemu_agent_enabled ? 1 : 0
automatic_reboot = true
cpu {
sockets = var.sockets
cores = var.cores
sockets = 1
cores = var.worker_cores[count.index]
}
memory = var.memory
memory = var.worker_memory_mb[count.index]
scsihw = "virtio-scsi-pci"
boot = "order=scsi0"
bootdisk = "scsi0"
ipconfig0 = "ip=dhcp"
cicustom = "user=local:snippets/cloud_init_global.yaml"
ipconfig0 = local.worker_ipconfig[count.index]
ciuser = "micqdf"
sshkeys = var.SSH_KEY_PUBLIC
disks {
scsi {
scsi0 {
disk {
size = var.disk_size
size = var.worker_disk_size
storage = var.storage
}
}
@@ -111,4 +128,8 @@ resource "proxmox_vm_qemu" "llamas" {
model = "virtio"
bridge = var.bridge
}
lifecycle {
ignore_changes = all
}
}

View File

@@ -1,21 +1,35 @@
output "alpaca_vm_ids" {
output "control_plane_vm_ids" {
value = {
for i in range(var.alpaca_vm_count) :
"alpaca-${i + 1}" => proxmox_vm_qemu.alpacas[i].vmid
for i in range(var.control_plane_count) :
"cp-${i + 1}" => proxmox_vm_qemu.control_planes[i].vmid
}
}
output "alpaca_vm_names" {
value = [for vm in proxmox_vm_qemu.alpacas : vm.name]
output "control_plane_vm_names" {
value = [for vm in proxmox_vm_qemu.control_planes : vm.name]
}
output "llama_vm_ids" {
output "control_plane_vm_ipv4" {
value = {
for i in range(var.llama_vm_count) :
"llama-${i + 1}" => proxmox_vm_qemu.llamas[i].vmid
for i in range(var.control_plane_count) :
proxmox_vm_qemu.control_planes[i].name => var.control_plane_ips[i]
}
}
output "llama_vm_names" {
value = [for vm in proxmox_vm_qemu.llamas : vm.name]
output "worker_vm_ids" {
value = {
for i in range(var.worker_count) :
"wk-${i + 1}" => proxmox_vm_qemu.workers[i].vmid
}
}
output "worker_vm_names" {
value = [for vm in proxmox_vm_qemu.workers : vm.name]
}
output "worker_vm_ipv4" {
value = {
for i in range(var.worker_count) :
proxmox_vm_qemu.workers[i].name => var.worker_ips[i]
}
}

View File

@@ -1,10 +1,25 @@
target_node = "flex"
clone_template = "nixos-template"
cores = 1
memory = 1024
disk_size = "15G"
sockets = 1
clone_template = "k8s-base-template"
bridge = "vmbr0"
storage = "Flash"
pm_api_url = "https://100.105.0.115:8006/api2/json"
pm_api_token_id = "terraform-prov@pve!mytoken"
control_plane_count = 3
worker_count = 3
control_plane_vmid_start = 701
worker_vmid_start = 711
control_plane_cores = 1
control_plane_memory_mb = 4096
control_plane_disk_size = "80G"
worker_cores = [4, 4, 4]
worker_memory_mb = [12288, 12288, 12288]
worker_disk_size = "120G"
network_prefix_length = 10
network_gateway = "10.27.27.1"
control_plane_ips = ["10.27.27.50", "10.27.27.51", "10.27.27.49"]
worker_ips = ["10.27.27.47", "10.27.27.46", "10.27.27.48"]

View File

@@ -27,20 +27,98 @@ variable "clone_template" {
type = string
}
variable "cores" {
type = number
variable "control_plane_count" {
type = number
default = 3
description = "Number of control plane VMs"
}
variable "memory" {
type = number
variable "worker_count" {
type = number
default = 3
description = "Number of worker VMs"
}
variable "disk_size" {
type = string
variable "control_plane_vmid_start" {
type = number
default = 701
description = "Starting VMID for control plane VMs"
}
variable "sockets" {
type = number
variable "worker_vmid_start" {
type = number
default = 711
description = "Starting VMID for worker VMs"
}
variable "control_plane_cores" {
type = number
default = 1
description = "vCPU cores per control plane VM"
}
variable "control_plane_memory_mb" {
type = number
default = 4096
description = "Memory in MB per control plane VM"
}
variable "worker_cores" {
type = list(number)
default = [4, 4, 4]
description = "vCPU cores for each worker VM"
}
variable "worker_memory_mb" {
type = list(number)
default = [12288, 12288, 12288]
description = "Memory in MB for each worker VM"
}
variable "control_plane_disk_size" {
type = string
default = "80G"
description = "Disk size for control plane VMs"
}
variable "worker_disk_size" {
type = string
default = "120G"
description = "Disk size for worker VMs"
}
variable "network_prefix_length" {
type = number
default = 10
description = "CIDR prefix length for static VM addresses"
}
variable "network_gateway" {
type = string
default = "10.27.27.1"
description = "Gateway for static VM addresses"
}
variable "control_plane_ips" {
type = list(string)
default = ["10.27.27.50", "10.27.27.51", "10.27.27.49"]
description = "Static IPv4 addresses for control plane VMs"
validation {
condition = length(var.control_plane_ips) == 3
error_message = "control_plane_ips must contain exactly 3 IPs."
}
}
variable "worker_ips" {
type = list(string)
default = ["10.27.27.47", "10.27.27.46", "10.27.27.48"]
description = "Static IPv4 addresses for worker VMs"
validation {
condition = length(var.worker_ips) == 3
error_message = "worker_ips must contain exactly 3 IPs."
}
}
variable "bridge" {
@@ -55,16 +133,10 @@ variable "pm_api_url" {
type = string
}
variable "alpaca_vm_count" {
type = number
default = 1
description = "How many Alpaca VMs to create"
}
variable "llama_vm_count" {
type = number
default = 1
description = "How many Llama VMs to create"
variable "qemu_agent_enabled" {
type = bool
default = false
description = "Enable QEMU guest agent integration in Proxmox resources"
}
variable "SSH_KEY_PUBLIC" {