82 Commits

Author SHA1 Message Date
5bfc135350 Merge pull request 'fix: ignore stale SSH host keys for ephemeral homelab VMs' (#130) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m24s
Reviewed-on: #130
2026-03-09 03:45:11 +00:00
63213a4bc3 fix: ignore stale SSH host keys for ephemeral homelab VMs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Fresh destroy/recreate cycles change VM host keys, which was breaking bootstrap after rebuilds. Use a disposable known-hosts policy in the controller SSH options so automation does not fail on expected key rotation.
2026-03-09 03:16:18 +00:00
e4243c7667 Merge pull request 'fix: keep DHCP enabled by default on template VM' (#129) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 1h50m42s
Reviewed-on: #129
2026-03-08 22:03:17 +00:00
33bb0ffb17 fix: keep DHCP enabled by default on template VM
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
The template machine can lose connectivity when rebuilt directly because it has no cloud-init network data during template maintenance. Restore DHCP as the default for the template itself while keeping cloud-init + networkd enabled so cloned VMs can still consume injected network settings.
2026-03-08 20:12:03 +00:00
7434a65590 Merge pull request 'stage' (#128) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m54s
Reviewed-on: #128
2026-03-08 18:06:46 +00:00
cd8e538c51 ci: switch checkout action source away from gitea.com mirror
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
The gitea.com checkout action mirror is timing out during workflow startup. Use actions/checkout@v4 directly so jobs do not fail before any repository logic runs.
2026-03-08 13:36:21 +00:00
808c290c71 chore: clarify stale template cloud-init failure message
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 31s
Make SSH bootstrap failures explain the real root cause when fresh clones never accept the injected user/key: the Proxmox source template itself still needs the updated cloud-init-capable NixOS configuration.
2026-03-08 13:16:37 +00:00
15e6471e7e Merge pull request 'fix: enable cloud-init networking in NixOS template' (#127) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m10s
Reviewed-on: #127
2026-03-08 05:33:57 +00:00
79a4c941e5 fix: enable cloud-init networking in NixOS template
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Freshly recreated VMs were reachable but did not accept the injected SSH key, which indicates Proxmox cloud-init settings were not being applied. Enable cloud-init and cloud-init network handling in the base template so static IPs, hostname, ciuser, and SSH keys take effect on first boot.
2026-03-08 05:16:19 +00:00
e9bac70cae Merge pull request 'fix: wait for SSH readiness after VM provisioning' (#126) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m56s
Reviewed-on: #126
2026-03-08 05:04:43 +00:00
4c167f618a fix: wait for SSH readiness after VM provisioning
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.
2026-03-08 05:00:39 +00:00
97295a7071 Merge pull request 'ci: speed up Terraform destroy plan by skipping refresh' (#125) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m0s
Reviewed-on: #125
2026-03-08 04:47:02 +00:00
7bc861b3e8 ci: speed up Terraform destroy plan by skipping refresh
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false for destroy workflows so manual NUKE runs do not spend minutes refreshing Proxmox VM state before building the destroy plan.
2026-03-08 04:37:52 +00:00
6ca189b32c Merge pull request 'fix: vendor Flannel manifest and harden CNI bootstrap timing' (#124) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 15m11s
Reviewed-on: #124
2026-03-08 04:10:47 +00:00
b7b364a112 fix: vendor Flannel manifest and harden CNI bootstrap timing
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Stop depending on GitHub during cluster bring-up by shipping the Flannel manifest in-repo, ensure required host paths exist on NixOS nodes, and wait/retry against a stable API before applying the CNI. This removes the TLS handshake timeout failure mode and makes early network bootstrap deterministic.
2026-03-08 03:24:16 +00:00
2aa9950f59 Merge pull request 'fix: add mount utility to kubelet service PATH' (#123) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m10s
Reviewed-on: #123
2026-03-08 02:16:23 +00:00
bd866f7dac fix: add mount utility to kubelet service PATH
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Flannel pods were stuck because kubelet could not execute mount for projected service account volumes on NixOS. Add util-linux to the kubelet systemd PATH so mount is available during volume setup.
2026-03-07 14:18:20 +00:00
c1f86483ad Merge pull request 'debug: print detailed Flannel pod diagnostics on rollout timeout' (#122) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m50s
Reviewed-on: #122
2026-03-07 12:31:43 +00:00
0cce4bcf72 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-07 12:22:01 +00:00
065567210e debug: print detailed Flannel pod diagnostics on rollout timeout
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
When kube-flannel daemonset rollout stalls, print pod descriptions and per-container logs for the init containers and main flannel container so the next failure shows the actual cause instead of only Init:0/2.
2026-03-07 12:19:21 +00:00
c5f0b1ac37 Merge pull request 'stage' (#121) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 30m28s
Reviewed-on: #121
2026-03-07 01:01:38 +00:00
e740d47011 Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
2026-03-07 00:57:47 +00:00
d9d3976c4c fix: use self-contained Terraform variable validations
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Terraform variable validation blocks can only reference the variable under validation. Replace count-based checks with fixed-length validations for the current 3 control planes and 3 workers.
2026-03-07 00:54:51 +00:00
a0b07816b9 refactor: simplify homelab bootstrap around static IPs and fresh runs
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 10s
Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.
2026-03-07 00:52:35 +00:00
d964ff8b50 Merge pull request 'fix: point Cilium directly at API server and print rollout diagnostics' (#120) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m43s
Reviewed-on: #120
2026-03-05 01:25:52 +00:00
e06b2c692e fix: point Cilium directly at API server and print rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Set Cilium k8sServiceHost/k8sServicePort to the primary control-plane API endpoint to avoid in-cluster service routing dependency during early bootstrap. Also print cilium daemonset/pod/log diagnostics when rollout times out.
2026-03-05 01:21:21 +00:00
c48bbddef3 Merge pull request 'fix: stabilize Cilium install defaults and add rollout diagnostics' (#119) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m43s
Reviewed-on: #119
2026-03-05 00:52:04 +00:00
ca54c44fa4 fix: stabilize Cilium install defaults and add rollout diagnostics
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Set Cilium kubeProxyReplacement from env (default false for homelab stability) and collect cilium daemonset/pod/log diagnostics when rollout times out during verification.
2026-03-05 00:48:41 +00:00
8bda08be07 Merge pull request 'fix: hard-reset nodes before kubeadm join retries' (#118) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 29m30s
Reviewed-on: #118
2026-03-05 00:16:31 +00:00
0778de9719 fix: hard-reset nodes before kubeadm join retries
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Before control-plane and worker joins, remove stale kubelet/kubernetes identity files and run kubeadm reset -f. This prevents preflight failures like FileAvailable--etc-kubernetes-kubelet.conf during repeated reconcile attempts.
2026-03-04 23:38:15 +00:00
92f0658995 Merge pull request 'fix: add heuristic SSH inventory fallback for generic hostnames' (#117) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m52s
Reviewed-on: #117
2026-03-04 23:13:08 +00:00
fc4eb1bc6e fix: add heuristic SSH inventory fallback for generic hostnames
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
When Proxmox guest-agent IPs are empty and SSH discovery returns duplicate generic hostnames (e.g. flex), assign remaining missing nodes from unmatched SSH-reachable IPs in deterministic order. Also emit SSH-reachable IP diagnostics on failure.
2026-03-04 23:07:45 +00:00
4b017364c8 Merge pull request 'fix: wait for Cilium and node readiness before marking bootstrap success' (#116) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m47s
Reviewed-on: #116
2026-03-04 22:57:39 +00:00
a70de061b0 fix: wait for Cilium and node readiness before marking bootstrap success
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 18s
Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.
2026-03-04 22:26:43 +00:00
9d98f56725 Merge pull request 'fix: add join preflight ignores for homelab control planes' (#115) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 44m43s
Reviewed-on: #115
2026-03-04 21:13:02 +00:00
5ddd00f711 fix: add join preflight ignores for homelab control planes
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.
2026-03-04 21:09:27 +00:00
5af4021228 Merge pull request 'fix: require kubelet kubeconfig before starting service' (#114) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m56s
Reviewed-on: #114
2026-03-04 20:46:48 +00:00
034869347a fix: require kubelet kubeconfig before starting service
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Inline kubelet bootstrap/kubeconfig flags in ExecStart and gate startup on /etc/kubernetes/*kubelet.conf in addition to config.yaml. This prevents kubelet entering standalone mode with webhook auth enabled when no client config is present.
2026-03-04 20:45:47 +00:00
50d0d99332 Merge pull request 'stage' (#113) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m7s
Reviewed-on: #113
2026-03-04 19:32:40 +00:00
f0093deedc fix: avoid assigning control-plane VIP as node SSH address
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 15s
Exclude the configured VIP suffix from subnet scans and prefer non-VIP IPs when multiple SSH endpoints resolve to the same node. This prevents cp-1 being discovered as .250 and later failing SSH commands against the floating VIP.
2026-03-04 19:26:37 +00:00
6b6ca021c9 fix: add kubelet bootstrap kubeconfig args to systemd unit
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Include KUBELET_KUBECONFIG_ARGS in kubelet ExecStart so kubelet can authenticate with bootstrap-kubelet.conf/kubelet.conf and register node objects during kubeadm init.
2026-03-04 19:26:07 +00:00
c034f7975c Merge pull request 'stage' (#112) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 28m53s
Reviewed-on: #112
2026-03-04 18:51:53 +00:00
90ef0ec33f Merge branch 'master' into stage
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
2026-03-04 18:42:22 +00:00
3281ebd216 Merge pull request 'fix: recover from kubeadm CRISocket node-registration race' (#111) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m6s
Reviewed-on: #111
2026-03-04 03:03:17 +00:00
981afc509a Merge pull request 'fix: use kubeadm v1beta4 list format for kubeletExtraArgs' (#110) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m48s
Reviewed-on: #110
2026-03-04 02:32:22 +00:00
8aab666fad Merge pull request 'fix: hard reset kubelet identity before kubeadm init' (#109) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 12m25s
Reviewed-on: #109
2026-03-04 01:42:55 +00:00
3fd7ed48b1 Merge pull request 'fix: pin kubeadm init node identity to flake hostname' (#108) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 15m22s
Reviewed-on: #108
2026-03-04 01:18:51 +00:00
99458ca829 Merge pull request 'fix: force fresh kubeadm init after rebuild and make kubelet enable-able' (#107) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m1s
Reviewed-on: #107
2026-03-04 00:56:30 +00:00
adc8a620f4 Merge pull request 'fix: force fresh bootstrap stages after rebuild and stabilize join node identity' (#106) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m28s
Reviewed-on: #106
2026-03-04 00:32:06 +00:00
f11aadf79c Merge pull request 'fix: map SSH-discovered nodes by VMID when hostnames are generic' (#105) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 27m43s
Reviewed-on: #105
2026-03-03 23:37:45 +00:00
09d2f56967 Merge pull request 'fix: make SSH inventory discovery more reliable on CI' (#104) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 8m46s
Reviewed-on: #104
2026-03-03 21:45:57 +00:00
f2b9da8a59 Merge pull request 'fix: run Cilium install with sudo and explicit kubeconfig' (#103) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 3m22s
Reviewed-on: #103
2026-03-03 08:56:49 +00:00
5fa96e27d7 Merge pull request 'fix: ensure kubelet is enabled for kubeadm init node registration' (#102) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 10m43s
Reviewed-on: #102
2026-03-03 01:13:47 +00:00
31017b5c3e Merge pull request 'fix: rebuild nodes by default on reconcile' (#101) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 13m53s
Reviewed-on: #101
2026-03-03 00:46:26 +00:00
f53d087c9c Merge pull request 'fix: use valid kube-vip log flag value' (#100) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m29s
Reviewed-on: #100
2026-03-03 00:26:08 +00:00
0e0643a6fc Merge pull request 'refactor: add Python bootstrap controller with resumable state' (#99) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 11m46s
Reviewed-on: #99
2026-03-03 00:10:19 +00:00
7a0016b003 Merge pull request 'fix: preserve kube-vip mount path and only swap hostPath to super-admin' (#98) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Has been cancelled
Reviewed-on: #98
2026-03-03 00:00:48 +00:00
e5162c220c Merge pull request 'fix: bootstrap kube-vip without leader election' (#97) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m12s
Reviewed-on: #97
2026-03-02 23:31:52 +00:00
84513f4bb8 Merge pull request 'fix: run kube-vip in control-plane-only mode during bootstrap' (#96) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m50s
Reviewed-on: #96
2026-03-02 22:53:22 +00:00
678b383063 Merge pull request 'stage' (#95) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 17m14s
Reviewed-on: #95
2026-03-02 22:33:27 +00:00
d86b0a32a2 Merge pull request 'fix: stabilize kubeadm bootstrap and reduce Proxmox plan latency' (#94) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 16m3s
Reviewed-on: #94
2026-03-02 22:13:28 +00:00
6c7182b8f5 Merge pull request 'fix: run kube-vip daemon before kubeadm init' (#93) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m52s
Reviewed-on: #93
2026-03-02 21:02:11 +00:00
8b15f061bc Merge pull request 'fix: skip kubeadm wait-control-plane phase, wait for VIP manually' (#92) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 23m51s
Reviewed-on: #92
2026-03-02 19:42:56 +00:00
c91d28a5dc Merge pull request 'fix: add image pre-pull and debug output for kubeadm init' (#91) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 26m27s
Reviewed-on: #91
2026-03-02 18:36:46 +00:00
cfdfab3ec0 Merge pull request 'fix: disable webhook authz and clean stale kubelet configs' (#90) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m1s
Reviewed-on: #90
2026-03-02 18:01:33 +00:00
cec60c003c Merge pull request 'fix: disable kubelet webhook auth in kubeadm init config' (#89) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m1s
Reviewed-on: #89
2026-03-02 16:50:31 +00:00
6cc57f8b0e Merge pull request 'fix: kubelet directories and containerd readiness' (#88) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m54s
Reviewed-on: #88
2026-03-02 14:45:54 +00:00
9d17dd17cc Merge pull request 'fix: remove kubelet ConditionPathExists, add daemon-reload' (#87) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 25m5s
Reviewed-on: #87
2026-03-02 14:01:06 +00:00
23d61a6308 Merge pull request 'fix: mask kubelet before rebuild, unmask in kubeadm helpers' (#86) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 24m58s
Reviewed-on: #86
2026-03-02 12:54:37 +00:00
198c147b79 Merge pull request 'fix: prevent kubelet auto-start during rebuild' (#85) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m58s
Reviewed-on: #85
2026-03-02 12:14:38 +00:00
3b03e68f3e Merge pull request 'fix: disable lingering kubelet service before node rebuild' (#84) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m50s
Reviewed-on: #84
2026-03-02 10:09:20 +00:00
92759407a6 Merge pull request 'fix: stop auto-enabling kubelet during base node rebuild' (#83) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m4s
Reviewed-on: #83
2026-03-02 09:17:26 +00:00
03c6d0454a Merge pull request 'fix: gate kubelet startup until kubeadm config exists' (#82) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m56s
Reviewed-on: #82
2026-03-02 08:40:39 +00:00
b8bd9686d3 Merge pull request 'fix: align kubelet systemd unit with kubeadm flags' (#81) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 18m42s
Reviewed-on: #81
2026-03-02 03:48:09 +00:00
cfd72fa750 Merge pull request 'fix: ignore kubeadm HTTPProxyCIDR preflight in homelab workflow' (#80) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 28m13s
Reviewed-on: #80
2026-03-02 03:10:37 +00:00
3ed3381140 Merge pull request 'fix: run kubeadm init/reset with clean environment' (#79) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m22s
Reviewed-on: #79
2026-03-02 02:39:27 +00:00
4569fcd2ea Merge pull request 'fix: harden kubeadm scripts for proxy and preflight issues' (#78) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m33s
Reviewed-on: #78
2026-03-02 02:09:11 +00:00
f7f3c7df3e Merge pull request 'fix: avoid sudo env loss for kube-vip image reference' (#77) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 20m59s
Reviewed-on: #77
2026-03-02 01:32:53 +00:00
766cd5db4f Merge pull request 'fix: correctly propagate remote command exit status' (#76) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m10s
Reviewed-on: #76
2026-03-02 01:04:44 +00:00
5fe36d0963 Merge pull request 'chore: trigger workflows' (#75) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m29s
Reviewed-on: #75
2026-03-02 00:18:38 +00:00
8103b02883 Merge pull request 'fix: require admin kubeconfig before skipping cp init' (#74) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m40s
Reviewed-on: #74
2026-03-01 23:43:29 +00:00
6262f61506 Merge pull request 'fix: make cp-1 init detection and join token generation robust' (#73) from stage into master
All checks were successful
Terraform Apply / Terraform Apply (push) Successful in 19m26s
Reviewed-on: #73
2026-03-01 22:40:10 +00:00
15 changed files with 450 additions and 197 deletions

View File

@@ -27,7 +27,7 @@ jobs:
fi fi
- name: Checkout repository - name: Checkout repository
uses: https://gitea.com/actions/checkout@v4 uses: actions/checkout@v4
- name: Create SSH key - name: Create SSH key
run: | run: |
@@ -103,25 +103,9 @@ jobs:
- name: Create kubeadm inventory - name: Create kubeadm inventory
env: env:
KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }} KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
KUBEADM_SUBNET_PREFIX: ${{ secrets.KUBEADM_SUBNET_PREFIX }}
run: | run: |
set -euo pipefail set -euo pipefail
TF_OUTPUT_JSON="" terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
for attempt in 1 2 3 4 5 6; do
echo "Inventory render attempt $attempt/6"
TF_OUTPUT_JSON="$(terraform -chdir=terraform output -json)"
if printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env; then
exit 0
fi
if [ "$attempt" -lt 6 ]; then
echo "VM IPv4s not available yet; waiting 30s before retry"
sleep 30
fi
done
echo "Falling back to SSH-based inventory discovery"
printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/discover-inventory-from-ssh.py > nixos/kubeadm/scripts/inventory.env
- name: Validate nix installation - name: Validate nix installation
run: | run: |

View File

@@ -27,7 +27,7 @@ jobs:
fi fi
- name: Checkout repository - name: Checkout repository
uses: https://gitea.com/actions/checkout@v4 uses: actions/checkout@v4
- name: Create SSH key - name: Create SSH key
run: | run: |
@@ -103,25 +103,9 @@ jobs:
- name: Create kubeadm inventory - name: Create kubeadm inventory
env: env:
KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }} KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
KUBEADM_SUBNET_PREFIX: ${{ secrets.KUBEADM_SUBNET_PREFIX }}
run: | run: |
set -euo pipefail set -euo pipefail
TF_OUTPUT_JSON="" terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
for attempt in 1 2 3 4 5 6; do
echo "Inventory render attempt $attempt/6"
TF_OUTPUT_JSON="$(terraform -chdir=terraform output -json)"
if printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env; then
exit 0
fi
if [ "$attempt" -lt 6 ]; then
echo "VM IPv4s not available yet; waiting 30s before retry"
sleep 30
fi
done
echo "Falling back to SSH-based inventory discovery"
printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/discover-inventory-from-ssh.py > nixos/kubeadm/scripts/inventory.env
- name: Run cluster reset - name: Run cluster reset
run: | run: |

View File

@@ -16,7 +16,7 @@ jobs:
steps: steps:
- name: Checkout repository - name: Checkout repository
uses: https://gitea.com/actions/checkout@v4 uses: actions/checkout@v4
- name: Create secrets.tfvars - name: Create secrets.tfvars
working-directory: terraform working-directory: terraform
@@ -151,25 +151,9 @@ jobs:
- name: Create kubeadm inventory from Terraform outputs - name: Create kubeadm inventory from Terraform outputs
env: env:
KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }} KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
KUBEADM_SUBNET_PREFIX: ${{ secrets.KUBEADM_SUBNET_PREFIX }}
run: | run: |
set -euo pipefail set -euo pipefail
TF_OUTPUT_JSON="" terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
for attempt in 1 2 3 4 5 6; do
echo "Inventory render attempt $attempt/6"
TF_OUTPUT_JSON="$(terraform -chdir=terraform output -json)"
if printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env; then
exit 0
fi
if [ "$attempt" -lt 6 ]; then
echo "VM IPv4s not available yet; waiting 30s before retry"
sleep 30
fi
done
echo "Falling back to SSH-based inventory discovery"
printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/discover-inventory-from-ssh.py > nixos/kubeadm/scripts/inventory.env
- name: Ensure nix and nixos-rebuild - name: Ensure nix and nixos-rebuild
env: env:

View File

@@ -36,7 +36,7 @@ jobs:
fi fi
- name: Checkout repository - name: Checkout repository
uses: https://gitea.com/actions/checkout@v4 uses: actions/checkout@v4
- name: Create Terraform secret files - name: Create Terraform secret files
working-directory: terraform working-directory: terraform
@@ -77,13 +77,13 @@ jobs:
set -euo pipefail set -euo pipefail
case "${{ inputs.target }}" in case "${{ inputs.target }}" in
all) all)
TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -out=tfdestroy" TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -out=tfdestroy"
;; ;;
control-planes) control-planes)
TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy" TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy"
;; ;;
workers) workers)
TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy" TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy"
;; ;;
*) *)
echo "Invalid destroy target: ${{ inputs.target }}" echo "Invalid destroy target: ${{ inputs.target }}"

View File

@@ -17,7 +17,7 @@ jobs:
steps: steps:
- name: Checkout repository - name: Checkout repository
uses: https://gitea.com/actions/checkout@v4 uses: actions/checkout@v4
- name: Create secrets.tfvars - name: Create secrets.tfvars
working-directory: terraform working-directory: terraform

View File

@@ -50,7 +50,7 @@ sudo nixos-rebuild switch --flake .#cp-1
For remote target-host workflows, use your preferred deploy wrapper later For remote target-host workflows, use your preferred deploy wrapper later
(`nixos-rebuild --target-host ...` or deploy-rs/colmena). (`nixos-rebuild --target-host ...` or deploy-rs/colmena).
## Bootstrap runbook (kubeadm + kube-vip + Cilium) ## Bootstrap runbook (kubeadm + kube-vip + Flannel)
1. Apply Nix config on all nodes (`cp-*`, then `wk-*`). 1. Apply Nix config on all nodes (`cp-*`, then `wk-*`).
2. On `cp-1`, run: 2. On `cp-1`, run:
@@ -62,14 +62,10 @@ sudo th-kubeadm-init
This infers the control-plane VIP as `<node-subnet>.250` on `eth0`, creates the This infers the control-plane VIP as `<node-subnet>.250` on `eth0`, creates the
kube-vip static pod manifest, and runs `kubeadm init`. kube-vip static pod manifest, and runs `kubeadm init`.
3. Install Cilium from `cp-1`: 3. Install Flannel from `cp-1`:
```bash ```bash
helm repo add cilium https://helm.cilium.io kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/v0.25.5/Documentation/kube-flannel.yml
helm repo update
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=true
``` ```
4. Generate join commands on `cp-1`: 4. Generate join commands on `cp-1`:
@@ -98,7 +94,7 @@ kubectl get nodes -o wide
kubectl -n kube-system get pods -o wide kubectl -n kube-system get pods -o wide
``` ```
## Repeatable rebuild flow (recommended) ## Fresh bootstrap flow (recommended)
1. Copy and edit inventory: 1. Copy and edit inventory:
@@ -107,7 +103,7 @@ cp ./scripts/inventory.example.env ./scripts/inventory.env
$EDITOR ./scripts/inventory.env $EDITOR ./scripts/inventory.env
``` ```
2. Rebuild all nodes and bootstrap/reconcile cluster: 2. Rebuild all nodes and bootstrap a fresh cluster:
```bash ```bash
./scripts/rebuild-and-bootstrap.sh ./scripts/rebuild-and-bootstrap.sh
@@ -141,15 +137,15 @@ For a full nuke/recreate lifecycle:
- run Terraform destroy/apply for VMs first, - run Terraform destroy/apply for VMs first,
- then run `./scripts/rebuild-and-bootstrap.sh` again. - then run `./scripts/rebuild-and-bootstrap.sh` again.
Node lists are discovered from Terraform outputs, so adding new workers/control Node lists now come directly from static Terraform outputs, so bootstrap no longer
planes in Terraform is picked up automatically by the bootstrap/reconcile flow. depends on Proxmox guest-agent IP discovery or SSH subnet scanning.
## Optional Gitea workflow automation ## Optional Gitea workflow automation
Primary flow: Primary flow:
- Push to `master` triggers `.gitea/workflows/terraform-apply.yml` - Push to `master` triggers `.gitea/workflows/terraform-apply.yml`
- That workflow now does Terraform apply and then runs kubeadm rebuild/bootstrap reconciliation automatically - That workflow now does Terraform apply and then runs a fresh kubeadm bootstrap automatically
Manual dispatch workflows are available: Manual dispatch workflows are available:
@@ -164,9 +160,7 @@ Required repository secrets:
Optional secrets: Optional secrets:
- `KUBEADM_SSH_USER` (defaults to `micqdf`) - `KUBEADM_SSH_USER` (defaults to `micqdf`)
- `KUBEADM_SUBNET_PREFIX` (optional, e.g. `10.27.27`; used for SSH-based IP discovery fallback) Node IPs are rendered directly from static Terraform outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets or SSH discovery fallbacks.
Node IPs are auto-discovered from Terraform state outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets.
## Notes ## Notes

View File

@@ -11,9 +11,6 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path from pathlib import Path
REMOTE_STATE_PATH = "/var/lib/terrahome/bootstrap-state.json"
def run_local(cmd, check=True, capture=False): def run_local(cmd, check=True, capture=False):
if isinstance(cmd, str): if isinstance(cmd, str):
shell = True shell = True
@@ -102,7 +99,6 @@ class Controller:
self.script_dir = Path(__file__).resolve().parent self.script_dir = Path(__file__).resolve().parent
self.flake_dir = Path(self.env.get("FLAKE_DIR") or (self.script_dir.parent)).resolve() self.flake_dir = Path(self.env.get("FLAKE_DIR") or (self.script_dir.parent)).resolve()
self.local_state_path = self.script_dir / "bootstrap-state-last.json"
self.ssh_user = self.env.get("SSH_USER", "micqdf") self.ssh_user = self.env.get("SSH_USER", "micqdf")
self.ssh_candidates = self.env.get("SSH_USER_CANDIDATES", f"root {self.ssh_user}").split() self.ssh_candidates = self.env.get("SSH_USER_CANDIDATES", f"root {self.ssh_user}").split()
@@ -114,7 +110,9 @@ class Controller:
"-o", "-o",
"IdentitiesOnly=yes", "IdentitiesOnly=yes",
"-o", "-o",
"StrictHostKeyChecking=accept-new", "StrictHostKeyChecking=no",
"-o",
"UserKnownHostsFile=/dev/null",
"-i", "-i",
self.ssh_key, self.ssh_key,
] ]
@@ -124,7 +122,9 @@ class Controller:
self.worker_parallelism = int(self.env.get("WORKER_PARALLELISM", "3")) self.worker_parallelism = int(self.env.get("WORKER_PARALLELISM", "3"))
self.fast_mode = self.env.get("FAST_MODE", "1") self.fast_mode = self.env.get("FAST_MODE", "1")
self.skip_rebuild = self.env.get("SKIP_REBUILD", "0") == "1" self.skip_rebuild = self.env.get("SKIP_REBUILD", "0") == "1"
self.force_reinit = False self.force_reinit = True
self.ssh_ready_retries = int(self.env.get("SSH_READY_RETRIES", "20"))
self.ssh_ready_delay = int(self.env.get("SSH_READY_DELAY_SEC", "15"))
def log(self, msg): def log(self, msg):
print(f"==> {msg}") print(f"==> {msg}")
@@ -134,13 +134,26 @@ class Controller:
return run_local(full, check=check, capture=True) return run_local(full, check=check, capture=True)
def detect_user(self, ip): def detect_user(self, ip):
for user in self.ssh_candidates: for attempt in range(1, self.ssh_ready_retries + 1):
proc = self._ssh(user, ip, "true", check=False) for user in self.ssh_candidates:
if proc.returncode == 0: proc = self._ssh(user, ip, "true", check=False)
self.active_ssh_user = user if proc.returncode == 0:
self.log(f"Using SSH user '{user}' for {ip}") self.active_ssh_user = user
return self.log(f"Using SSH user '{user}' for {ip}")
raise RuntimeError(f"Unable to authenticate to {ip} with users: {', '.join(self.ssh_candidates)}") return
if attempt < self.ssh_ready_retries:
self.log(
f"SSH not ready on {ip} yet; retrying in {self.ssh_ready_delay}s "
f"({attempt}/{self.ssh_ready_retries})"
)
time.sleep(self.ssh_ready_delay)
raise RuntimeError(
"Unable to authenticate to "
f"{ip} with users: {', '.join(self.ssh_candidates)}. "
"If this is a freshly cloned VM, the Proxmox source template likely does not yet include the "
"current cloud-init-capable NixOS template configuration from nixos/template-base. "
"Terraform can only clone what exists in Proxmox; it cannot retrofit cloud-init support into an old template."
)
def remote(self, ip, cmd, check=True): def remote(self, ip, cmd, check=True):
ordered = [self.active_ssh_user] + [u for u in self.ssh_candidates if u != self.active_ssh_user] ordered = [self.active_ssh_user] + [u for u in self.ssh_candidates if u != self.active_ssh_user]
@@ -161,53 +174,7 @@ class Controller:
return last return last
def prepare_known_hosts(self): def prepare_known_hosts(self):
ssh_dir = Path.home() / ".ssh" pass
ssh_dir.mkdir(parents=True, exist_ok=True)
(ssh_dir / "known_hosts").touch()
run_local(["chmod", "700", str(ssh_dir)])
run_local(["chmod", "600", str(ssh_dir / "known_hosts")])
for ip in self.node_ips.values():
run_local(["ssh-keygen", "-R", ip], check=False)
run_local(f"ssh-keyscan -H {shlex.quote(ip)} >> {shlex.quote(str(ssh_dir / 'known_hosts'))}", check=False)
def get_state(self):
proc = self.remote(
self.primary_ip,
"sudo test -f /var/lib/terrahome/bootstrap-state.json && sudo cat /var/lib/terrahome/bootstrap-state.json || echo '{}'",
)
try:
state = json.loads(proc.stdout.strip() or "{}")
except Exception:
state = {}
return state
def set_state(self, state):
payload = json.dumps(state, sort_keys=True)
b64 = base64.b64encode(payload.encode()).decode()
self.remote(
self.primary_ip,
(
"sudo mkdir -p /var/lib/terrahome && "
f"echo {shlex.quote(b64)} | base64 -d | sudo tee {REMOTE_STATE_PATH} >/dev/null"
),
)
self.local_state_path.write_text(payload + "\n", encoding="utf-8")
def mark_done(self, key):
state = self.get_state()
state[key] = True
state["updated_at"] = int(time.time())
self.set_state(state)
def clear_done(self, keys):
state = self.get_state()
for key in keys:
state.pop(key, None)
state["updated_at"] = int(time.time())
self.set_state(state)
def stage_done(self, key):
return bool(self.get_state().get(key))
def prepare_remote_nix(self, ip): def prepare_remote_nix(self, ip):
self.remote(ip, "sudo mkdir -p /etc/nix") self.remote(ip, "sudo mkdir -p /etc/nix")
@@ -257,15 +224,11 @@ class Controller:
raise RuntimeError(f"Rebuild failed permanently for {name}") raise RuntimeError(f"Rebuild failed permanently for {name}")
def stage_preflight(self): def stage_preflight(self):
if self.stage_done("preflight_done"):
self.log("Preflight already complete")
return
self.prepare_known_hosts() self.prepare_known_hosts()
self.detect_user(self.primary_ip) self.detect_user(self.primary_ip)
self.mark_done("preflight_done")
def stage_rebuild(self): def stage_rebuild(self):
if self.skip_rebuild and self.stage_done("nodes_rebuilt"): if self.skip_rebuild:
self.log("Node rebuild already complete") self.log("Node rebuild already complete")
return return
@@ -299,17 +262,6 @@ class Controller:
if failures: if failures:
raise RuntimeError(f"Worker rebuild failures: {failures}") raise RuntimeError(f"Worker rebuild failures: {failures}")
# Rebuild can invalidate prior bootstrap stages; force reconciliation.
self.force_reinit = True
self.clear_done([
"primary_initialized",
"cni_installed",
"control_planes_joined",
"workers_joined",
"verified",
])
self.mark_done("nodes_rebuilt")
def has_admin_conf(self): def has_admin_conf(self):
return self.remote(self.primary_ip, "sudo test -f /etc/kubernetes/admin.conf", check=False).returncode == 0 return self.remote(self.primary_ip, "sudo test -f /etc/kubernetes/admin.conf", check=False).returncode == 0
@@ -318,37 +270,52 @@ class Controller:
return self.remote(self.primary_ip, cmd, check=False).returncode == 0 return self.remote(self.primary_ip, cmd, check=False).returncode == 0
def stage_init_primary(self): def stage_init_primary(self):
if (not self.force_reinit) and self.stage_done("primary_initialized") and self.has_admin_conf() and self.cluster_ready(): self.log(f"Initializing primary control plane on {self.primary_cp}")
self.log("Primary control plane init already complete") self.remote(self.primary_ip, "sudo th-kubeadm-init")
return
if (not self.force_reinit) and self.has_admin_conf() and self.cluster_ready():
self.log("Existing cluster detected on primary control plane")
else:
self.log(f"Initializing primary control plane on {self.primary_cp}")
self.remote(self.primary_ip, "sudo th-kubeadm-init")
self.mark_done("primary_initialized")
def stage_install_cni(self): def stage_install_cni(self):
if self.stage_done("cni_installed") and self.cluster_ready(): self.log("Installing Flannel")
self.log("CNI install already complete") manifest_path = self.script_dir.parent / "manifests" / "kube-flannel.yml"
return manifest_b64 = base64.b64encode(manifest_path.read_bytes()).decode()
self.log("Installing or upgrading Cilium")
self.remote(self.primary_ip, "sudo helm repo add cilium https://helm.cilium.io >/dev/null 2>&1 || true")
self.remote(self.primary_ip, "sudo helm repo update >/dev/null")
self.remote(self.primary_ip, "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf create namespace kube-system >/dev/null 2>&1 || true")
self.remote( self.remote(
self.primary_ip, self.primary_ip,
"sudo KUBECONFIG=/etc/kubernetes/admin.conf helm upgrade --install cilium cilium/cilium --namespace kube-system --set kubeProxyReplacement=true", (
"sudo mkdir -p /var/lib/terrahome && "
f"echo {shlex.quote(manifest_b64)} | base64 -d | sudo tee /var/lib/terrahome/kube-flannel.yml >/dev/null"
),
) )
self.mark_done("cni_installed")
self.log("Waiting for API readiness before applying Flannel")
ready = False
for _ in range(30):
if self.cluster_ready():
ready = True
break
time.sleep(10)
if not ready:
raise RuntimeError("API server did not become ready before Flannel install")
last_error = None
for attempt in range(1, 6):
proc = self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf apply -f /var/lib/terrahome/kube-flannel.yml",
check=False,
)
if proc.returncode == 0:
return
last_error = (proc.stdout or "") + ("\n" if proc.stdout and proc.stderr else "") + (proc.stderr or "")
self.log(f"Flannel apply attempt {attempt}/5 failed; retrying in 15s")
time.sleep(15)
raise RuntimeError(f"Flannel apply failed after retries\n{last_error or ''}")
def cluster_has_node(self, name): def cluster_has_node(self, name):
cmd = f"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get node {shlex.quote(name)} >/dev/null 2>&1" cmd = f"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get node {shlex.quote(name)} >/dev/null 2>&1"
return self.remote(self.primary_ip, cmd, check=False).returncode == 0 return self.remote(self.primary_ip, cmd, check=False).returncode == 0
def build_join_cmds(self): def build_join_cmds(self):
if not self.has_admin_conf():
self.remote(self.primary_ip, "sudo th-kubeadm-init")
join_cmd = self.remote( join_cmd = self.remote(
self.primary_ip, self.primary_ip,
"sudo KUBECONFIG=/etc/kubernetes/admin.conf kubeadm token create --print-join-command", "sudo KUBECONFIG=/etc/kubernetes/admin.conf kubeadm token create --print-join-command",
@@ -361,9 +328,6 @@ class Controller:
return join_cmd, cp_join return join_cmd, cp_join
def stage_join_control_planes(self): def stage_join_control_planes(self):
if self.stage_done("control_planes_joined"):
self.log("Control-plane join already complete")
return
_, cp_join = self.build_join_cmds() _, cp_join = self.build_join_cmds()
for node in self.cp_names: for node in self.cp_names:
if node == self.primary_cp: if node == self.primary_cp:
@@ -373,14 +337,10 @@ class Controller:
continue continue
self.log(f"Joining control plane {node}") self.log(f"Joining control plane {node}")
ip = self.node_ips[node] ip = self.node_ips[node]
node_join = f"{cp_join} --node-name {node}" node_join = f"{cp_join} --node-name {node} --ignore-preflight-errors=NumCPU,HTTPProxyCIDR"
self.remote(ip, f"sudo th-kubeadm-join-control-plane {shlex.quote(node_join)}") self.remote(ip, f"sudo th-kubeadm-join-control-plane {shlex.quote(node_join)}")
self.mark_done("control_planes_joined")
def stage_join_workers(self): def stage_join_workers(self):
if self.stage_done("workers_joined"):
self.log("Worker join already complete")
return
join_cmd, _ = self.build_join_cmds() join_cmd, _ = self.build_join_cmds()
for node in self.wk_names: for node in self.wk_names:
if self.cluster_has_node(node): if self.cluster_has_node(node):
@@ -388,18 +348,55 @@ class Controller:
continue continue
self.log(f"Joining worker {node}") self.log(f"Joining worker {node}")
ip = self.node_ips[node] ip = self.node_ips[node]
node_join = f"{join_cmd} --node-name {node}" node_join = f"{join_cmd} --node-name {node} --ignore-preflight-errors=HTTPProxyCIDR"
self.remote(ip, f"sudo th-kubeadm-join-worker {shlex.quote(node_join)}") self.remote(ip, f"sudo th-kubeadm-join-worker {shlex.quote(node_join)}")
self.mark_done("workers_joined")
def stage_verify(self): def stage_verify(self):
if self.stage_done("verified"):
self.log("Verification already complete")
return
self.log("Final node verification") self.log("Final node verification")
try:
self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel rollout status ds/kube-flannel-ds --timeout=10m",
)
except Exception:
self.log("Flannel rollout failed; collecting diagnostics")
proc = self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get ds -o wide || true",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o wide || true",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do echo \"--- describe $p ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel describe $p || true; done",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do echo \"--- logs $p kube-flannel ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c kube-flannel --tail=120 || true; echo \"--- logs $p install-cni-plugin ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c install-cni-plugin --tail=120 || true; echo \"--- logs $p install-cni ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c install-cni --tail=120 || true; done",
check=False,
)
print(proc.stdout)
proc = self.remote(
self.primary_ip,
"for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs --tail=120 $p || true; done",
check=False,
)
print(proc.stdout)
raise
self.remote(
self.primary_ip,
"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf wait --for=condition=Ready nodes --all --timeout=10m",
)
proc = self.remote(self.primary_ip, "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide") proc = self.remote(self.primary_ip, "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide")
print(proc.stdout) print(proc.stdout)
self.mark_done("verified")
def reconcile(self): def reconcile(self):
self.stage_preflight() self.stage_preflight()

View File

@@ -0,0 +1,212 @@
---
kind: Namespace
apiVersion: v1
metadata:
name: kube-flannel
labels:
k8s-app: flannel
pod-security.kubernetes.io/enforce: privileged
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
labels:
k8s-app: flannel
name: flannel
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
labels:
k8s-app: flannel
name: flannel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: flannel
subjects:
- kind: ServiceAccount
name: flannel
namespace: kube-flannel
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-app: flannel
name: flannel
namespace: kube-flannel
---
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-flannel
labels:
tier: node
k8s-app: flannel
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.244.0.0/16",
"EnableNFTables": false,
"Backend": {
"Type": "vxlan"
}
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-flannel-ds
namespace: kube-flannel
labels:
tier: node
app: flannel
k8s-app: flannel
spec:
selector:
matchLabels:
app: flannel
template:
metadata:
labels:
tier: node
app: flannel
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
hostNetwork: true
priorityClassName: system-node-critical
tolerations:
- operator: Exists
effect: NoSchedule
serviceAccountName: flannel
initContainers:
- name: install-cni-plugin
image: docker.io/flannel/flannel-cni-plugin:v1.5.1-flannel1
command:
- cp
args:
- -f
- /flannel
- /opt/cni/bin/flannel
volumeMounts:
- name: cni-plugin
mountPath: /opt/cni/bin
- name: install-cni
image: docker.io/flannel/flannel:v0.25.5
command:
- cp
args:
- -f
- /etc/kube-flannel/cni-conf.json
- /etc/cni/net.d/10-flannel.conflist
volumeMounts:
- name: cni
mountPath: /etc/cni/net.d
- name: flannel-cfg
mountPath: /etc/kube-flannel/
containers:
- name: kube-flannel
image: docker.io/flannel/flannel:v0.25.5
command:
- /opt/bin/flanneld
args:
- --ip-masq
- --kube-subnet-mgr
resources:
requests:
cpu: "100m"
memory: "50Mi"
securityContext:
privileged: false
capabilities:
add: ["NET_ADMIN", "NET_RAW"]
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: EVENT_QUEUE_DEPTH
value: "5000"
volumeMounts:
- name: run
mountPath: /run/flannel
- name: flannel-cfg
mountPath: /etc/kube-flannel/
- name: xtables-lock
mountPath: /run/xtables.lock
volumes:
- name: run
hostPath:
path: /run/flannel
type: DirectoryOrCreate
- name: cni-plugin
hostPath:
path: /opt/cni/bin
type: DirectoryOrCreate
- name: cni
hostPath:
path: /etc/cni/net.d
type: DirectoryOrCreate
- name: flannel-cfg
configMap:
name: kube-flannel-cfg
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate

View File

@@ -338,12 +338,16 @@ in
> /etc/kubernetes/manifests/kube-vip.yaml > /etc/kubernetes/manifests/kube-vip.yaml
rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
rm -rf /var/lib/kubelet/pki
systemctl unmask kubelet || true systemctl unmask kubelet || true
systemctl stop kubelet || true systemctl stop kubelet || true
systemctl enable kubelet || true systemctl enable kubelet || true
systemctl reset-failed kubelet || true systemctl reset-failed kubelet || true
systemctl daemon-reload systemctl daemon-reload
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
eval "$1" eval "$1"
'') '')
@@ -356,12 +360,16 @@ in
fi fi
rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
rm -rf /var/lib/kubelet/pki
systemctl unmask kubelet || true systemctl unmask kubelet || true
systemctl stop kubelet || true systemctl stop kubelet || true
systemctl enable kubelet || true systemctl enable kubelet || true
systemctl reset-failed kubelet || true systemctl reset-failed kubelet || true
systemctl daemon-reload systemctl daemon-reload
env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
eval "$1" eval "$1"
'') '')
@@ -376,6 +384,7 @@ in
systemd.services.kubelet = { systemd.services.kubelet = {
description = "Kubernetes Kubelet"; description = "Kubernetes Kubelet";
wantedBy = [ "multi-user.target" ]; wantedBy = [ "multi-user.target" ];
path = [ pkgs.util-linux ];
wants = [ "network-online.target" ]; wants = [ "network-online.target" ];
after = [ "containerd.service" "network-online.target" ]; after = [ "containerd.service" "network-online.target" ];
serviceConfig = { serviceConfig = {
@@ -388,18 +397,22 @@ in
"-/var/lib/kubelet/kubeadm-flags.env" "-/var/lib/kubelet/kubeadm-flags.env"
"-/etc/default/kubelet" "-/etc/default/kubelet"
]; ];
ExecStart = "${pinnedK8s}/bin/kubelet \$KUBELET_CONFIG_ARGS \$KUBELET_KUBEADM_ARGS \$KUBELET_EXTRA_ARGS"; ExecStart = "${pinnedK8s}/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf \$KUBELET_CONFIG_ARGS \$KUBELET_KUBEADM_ARGS \$KUBELET_EXTRA_ARGS";
Restart = "on-failure"; Restart = "on-failure";
RestartSec = "10"; RestartSec = "10";
}; };
unitConfig = { unitConfig = {
ConditionPathExists = "/var/lib/kubelet/config.yaml"; ConditionPathExists = "/var/lib/kubelet/config.yaml";
ConditionPathExistsGlob = "/etc/kubernetes/*kubelet.conf";
}; };
}; };
systemd.tmpfiles.rules = [ systemd.tmpfiles.rules = [
"d /etc/kubernetes 0755 root root -" "d /etc/kubernetes 0755 root root -"
"d /etc/kubernetes/manifests 0755 root root -" "d /etc/kubernetes/manifests 0755 root root -"
"d /etc/cni/net.d 0755 root root -"
"d /opt/cni/bin 0755 root root -"
"d /run/flannel 0755 root root -"
"d /var/lib/kubelet 0755 root root -" "d /var/lib/kubelet 0755 root root -"
"d /var/lib/kubelet/pki 0755 root root -" "d /var/lib/kubelet/pki 0755 root root -"
]; ];

View File

@@ -96,8 +96,19 @@ def main() -> int:
prefix = derive_prefix(payload) prefix = derive_prefix(payload)
start = int(os.environ.get("KUBEADM_SUBNET_START", "2")) start = int(os.environ.get("KUBEADM_SUBNET_START", "2"))
end = int(os.environ.get("KUBEADM_SUBNET_END", "254")) end = int(os.environ.get("KUBEADM_SUBNET_END", "254"))
vip_suffix = int(os.environ.get("KUBEADM_CONTROL_PLANE_VIP_SUFFIX", "250"))
scan_ips = [str(ipaddress.IPv4Address(f"{prefix}.{i}")) for i in range(start, end + 1)] def is_vip_ip(ip: str) -> bool:
try:
return int(ip.split(".")[-1]) == vip_suffix
except Exception:
return False
scan_ips = [
str(ipaddress.IPv4Address(f"{prefix}.{i}"))
for i in range(start, end + 1)
if i != vip_suffix
]
found: Dict[str, str] = {} found: Dict[str, str] = {}
vmid_to_name: Dict[str, str] = {} vmid_to_name: Dict[str, str] = {}
for name, vmid in payload.get("control_plane_vm_ids", {}).get("value", {}).items(): for name, vmid in payload.get("control_plane_vm_ids", {}).get("value", {}).items():
@@ -106,6 +117,7 @@ def main() -> int:
vmid_to_name[str(vmid)] = name vmid_to_name[str(vmid)] = name
seen_hostnames: Dict[str, str] = {} seen_hostnames: Dict[str, str] = {}
seen_ips: Dict[str, Tuple[str, str]] = {}
def run_pass(pass_timeout: int, pass_workers: int) -> None: def run_pass(pass_timeout: int, pass_workers: int) -> None:
with concurrent.futures.ThreadPoolExecutor(max_workers=pass_workers) as pool: with concurrent.futures.ThreadPoolExecutor(max_workers=pass_workers) as pool:
@@ -117,12 +129,19 @@ def main() -> int:
host, ip, serial = result host, ip, serial = result
if host not in seen_hostnames: if host not in seen_hostnames:
seen_hostnames[host] = ip seen_hostnames[host] = ip
if host in target_names and host not in found: if ip not in seen_ips:
found[host] = ip seen_ips[ip] = (host, serial)
elif serial in vmid_to_name: target = None
if serial in vmid_to_name:
inferred = vmid_to_name[serial] inferred = vmid_to_name[serial]
if inferred not in found: target = inferred
found[inferred] = ip elif host in target_names:
target = host
if target:
existing = found.get(target)
if existing is None or (is_vip_ip(existing) and not is_vip_ip(ip)):
found[target] = ip
if all(name in found for name in target_names): if all(name in found for name in target_names):
return return
@@ -131,11 +150,25 @@ def main() -> int:
# Slower second pass for busy runners/networks. # Slower second pass for busy runners/networks.
run_pass(max(timeout_sec + 2, 8), max(8, max_workers // 2)) run_pass(max(timeout_sec + 2, 8), max(8, max_workers // 2))
# Heuristic fallback: if nodes still missing, assign from remaining SSH-reachable
# IPs not already used, ordered by IP. This helps when cloned nodes temporarily
# share a generic hostname (e.g. "flex") and DMI serial mapping is unavailable.
missing = sorted([n for n in target_names if n not in found])
if missing:
used_ips = set(found.values())
candidates = sorted(ip for ip in seen_ips.keys() if ip not in used_ips)
if len(candidates) >= len(missing):
for name, ip in zip(missing, candidates):
found[name] = ip
missing = sorted([n for n in target_names if n not in found]) missing = sorted([n for n in target_names if n not in found])
if missing: if missing:
discovered = ", ".join(sorted(seen_hostnames.keys())[:20]) discovered = ", ".join(sorted(seen_hostnames.keys())[:20])
if discovered: if discovered:
sys.stderr.write(f"Discovered hostnames during scan: {discovered}\n") sys.stderr.write(f"Discovered hostnames during scan: {discovered}\n")
if seen_ips:
sample = ", ".join(f"{ip}={meta[0]}" for ip, meta in list(sorted(seen_ips.items()))[:20])
sys.stderr.write(f"SSH-reachable IPs: {sample}\n")
raise SystemExit( raise SystemExit(
"Failed SSH-based IP discovery for nodes: " + ", ".join(missing) + "Failed SSH-based IP discovery for nodes: " + ", ".join(missing) +
f" (scanned {prefix}.{start}-{prefix}.{end})" f" (scanned {prefix}.{start}-{prefix}.{end})"

View File

@@ -11,6 +11,7 @@ in
networking.hostName = "k8s-base-template"; networking.hostName = "k8s-base-template";
networking.useDHCP = lib.mkDefault true; networking.useDHCP = lib.mkDefault true;
networking.useNetworkd = true;
networking.nameservers = [ "1.1.1.1" "8.8.8.8" ]; networking.nameservers = [ "1.1.1.1" "8.8.8.8" ];
boot.loader.systemd-boot.enable = lib.mkForce false; boot.loader.systemd-boot.enable = lib.mkForce false;
@@ -20,6 +21,8 @@ in
}; };
services.qemuGuest.enable = true; services.qemuGuest.enable = true;
services.cloud-init.enable = true;
services.cloud-init.network.enable = true;
services.openssh.enable = true; services.openssh.enable = true;
services.openssh.settings = { services.openssh.settings = {
PasswordAuthentication = false; PasswordAuthentication = false;

View File

@@ -9,6 +9,15 @@ terraform {
} }
} }
locals {
control_plane_ipconfig = [
for ip in var.control_plane_ips : "ip=${ip}/${var.network_prefix_length},gw=${var.network_gateway}"
]
worker_ipconfig = [
for ip in var.worker_ips : "ip=${ip}/${var.network_prefix_length},gw=${var.network_gateway}"
]
}
provider "proxmox" { provider "proxmox" {
pm_api_url = var.pm_api_url pm_api_url = var.pm_api_url
pm_api_token_id = var.pm_api_token_id pm_api_token_id = var.pm_api_token_id
@@ -35,7 +44,7 @@ resource "proxmox_vm_qemu" "control_planes" {
scsihw = "virtio-scsi-pci" scsihw = "virtio-scsi-pci"
boot = "order=scsi0" boot = "order=scsi0"
bootdisk = "scsi0" bootdisk = "scsi0"
ipconfig0 = "ip=dhcp" ipconfig0 = local.control_plane_ipconfig[count.index]
ciuser = "micqdf" ciuser = "micqdf"
sshkeys = var.SSH_KEY_PUBLIC sshkeys = var.SSH_KEY_PUBLIC
@@ -90,7 +99,7 @@ resource "proxmox_vm_qemu" "workers" {
scsihw = "virtio-scsi-pci" scsihw = "virtio-scsi-pci"
boot = "order=scsi0" boot = "order=scsi0"
bootdisk = "scsi0" bootdisk = "scsi0"
ipconfig0 = "ip=dhcp" ipconfig0 = local.worker_ipconfig[count.index]
ciuser = "micqdf" ciuser = "micqdf"
sshkeys = var.SSH_KEY_PUBLIC sshkeys = var.SSH_KEY_PUBLIC

View File

@@ -11,8 +11,8 @@ output "control_plane_vm_names" {
output "control_plane_vm_ipv4" { output "control_plane_vm_ipv4" {
value = { value = {
for vm in proxmox_vm_qemu.control_planes : for i in range(var.control_plane_count) :
vm.name => vm.default_ipv4_address proxmox_vm_qemu.control_planes[i].name => var.control_plane_ips[i]
} }
} }
@@ -29,7 +29,7 @@ output "worker_vm_names" {
output "worker_vm_ipv4" { output "worker_vm_ipv4" {
value = { value = {
for vm in proxmox_vm_qemu.workers : for i in range(var.worker_count) :
vm.name => vm.default_ipv4_address proxmox_vm_qemu.workers[i].name => var.worker_ips[i]
} }
} }

View File

@@ -17,3 +17,9 @@ control_plane_disk_size = "80G"
worker_cores = [4, 4, 4] worker_cores = [4, 4, 4]
worker_memory_mb = [12288, 12288, 12288] worker_memory_mb = [12288, 12288, 12288]
worker_disk_size = "120G" worker_disk_size = "120G"
network_prefix_length = 10
network_gateway = "10.27.27.1"
control_plane_ips = ["10.27.27.50", "10.27.27.51", "10.27.27.49"]
worker_ips = ["10.27.27.47", "10.27.27.46", "10.27.27.48"]

View File

@@ -87,6 +87,40 @@ variable "worker_disk_size" {
description = "Disk size for worker VMs" description = "Disk size for worker VMs"
} }
variable "network_prefix_length" {
type = number
default = 10
description = "CIDR prefix length for static VM addresses"
}
variable "network_gateway" {
type = string
default = "10.27.27.1"
description = "Gateway for static VM addresses"
}
variable "control_plane_ips" {
type = list(string)
default = ["10.27.27.50", "10.27.27.51", "10.27.27.49"]
description = "Static IPv4 addresses for control plane VMs"
validation {
condition = length(var.control_plane_ips) == 3
error_message = "control_plane_ips must contain exactly 3 IPs."
}
}
variable "worker_ips" {
type = list(string)
default = ["10.27.27.47", "10.27.27.46", "10.27.27.48"]
description = "Static IPv4 addresses for worker VMs"
validation {
condition = length(var.worker_ips) == 3
error_message = "worker_ips must contain exactly 3 IPs."
}
}
variable "bridge" { variable "bridge" {
type = string type = string
} }