Merge pull request 'fix: ignore stale SSH host keys for ephemeral homelab VMs' (#130 ) from stage into master

Reviewed-on: #130
fix: ignore stale SSH host keys for ephemeral homelab VMs
2026-03-09 03:45:11 +00:00 · 2026-03-09 03:16:18 +00:00 · 2026-03-08 22:03:17 +00:00 · 2026-03-08 20:12:03 +00:00 · 2026-03-08 18:06:46 +00:00 · 2026-03-08 13:36:21 +00:00
15 changed files with 499 additions and 225 deletions
--- a/.gitea/workflows/kubeadm-bootstrap.yml
+++ b/.gitea/workflows/kubeadm-bootstrap.yml
@@ -27,7 +27,7 @@ jobs:
          fi
      - name: Checkout repository
-        uses: https://gitea.com/actions/checkout@v4
+        uses: actions/checkout@v4
      - name: Create SSH key
        run: |
@@ -103,25 +103,9 @@ jobs:
      - name: Create kubeadm inventory
        env:
          KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
          KUBEADM_SUBNET_PREFIX: ${{ secrets.KUBEADM_SUBNET_PREFIX }}
        run: |
          set -euo pipefail
-          TF_OUTPUT_JSON=""
+          terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
          for attempt in 1 2 3 4 5 6; do
            echo "Inventory render attempt $attempt/6"
            TF_OUTPUT_JSON="$(terraform -chdir=terraform output -json)"
            if printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env; then
              exit 0
            fi
            if [ "$attempt" -lt 6 ]; then
              echo "VM IPv4s not available yet; waiting 30s before retry"
              sleep 30
            fi
          done
          echo "Falling back to SSH-based inventory discovery"
          printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/discover-inventory-from-ssh.py > nixos/kubeadm/scripts/inventory.env
      - name: Validate nix installation
        run: |
--- a/.gitea/workflows/kubeadm-reset.yml
+++ b/.gitea/workflows/kubeadm-reset.yml
@@ -27,7 +27,7 @@ jobs:
          fi
      - name: Checkout repository
-        uses: https://gitea.com/actions/checkout@v4
+        uses: actions/checkout@v4
      - name: Create SSH key
        run: |
@@ -103,25 +103,9 @@ jobs:
      - name: Create kubeadm inventory
        env:
          KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
          KUBEADM_SUBNET_PREFIX: ${{ secrets.KUBEADM_SUBNET_PREFIX }}
        run: |
          set -euo pipefail
-          TF_OUTPUT_JSON=""
+          terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
          for attempt in 1 2 3 4 5 6; do
            echo "Inventory render attempt $attempt/6"
            TF_OUTPUT_JSON="$(terraform -chdir=terraform output -json)"
            if printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env; then
              exit 0
            fi
            if [ "$attempt" -lt 6 ]; then
              echo "VM IPv4s not available yet; waiting 30s before retry"
              sleep 30
            fi
          done
          echo "Falling back to SSH-based inventory discovery"
          printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/discover-inventory-from-ssh.py > nixos/kubeadm/scripts/inventory.env
      - name: Run cluster reset
        run: |
--- a/.gitea/workflows/terraform-apply.yml
+++ b/.gitea/workflows/terraform-apply.yml
@@ -16,7 +16,7 @@ jobs:
    steps:
      - name: Checkout repository
-        uses: https://gitea.com/actions/checkout@v4
+        uses: actions/checkout@v4
      - name: Create secrets.tfvars
        working-directory: terraform
@@ -151,25 +151,9 @@ jobs:
      - name: Create kubeadm inventory from Terraform outputs
        env:
          KUBEADM_SSH_USER: ${{ secrets.KUBEADM_SSH_USER }}
          KUBEADM_SUBNET_PREFIX: ${{ secrets.KUBEADM_SUBNET_PREFIX }}
        run: |
          set -euo pipefail
-          TF_OUTPUT_JSON=""
+          terraform -chdir=terraform output -json | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env
          for attempt in 1 2 3 4 5 6; do
            echo "Inventory render attempt $attempt/6"
            TF_OUTPUT_JSON="$(terraform -chdir=terraform output -json)"
            if printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/render-inventory-from-tf-output.py > nixos/kubeadm/scripts/inventory.env; then
              exit 0
            fi
            if [ "$attempt" -lt 6 ]; then
              echo "VM IPv4s not available yet; waiting 30s before retry"
              sleep 30
            fi
          done
          echo "Falling back to SSH-based inventory discovery"
          printf '%s' "$TF_OUTPUT_JSON" | ./nixos/kubeadm/scripts/discover-inventory-from-ssh.py > nixos/kubeadm/scripts/inventory.env
      - name: Ensure nix and nixos-rebuild
        env:
--- a/.gitea/workflows/terraform-destroy.yml
+++ b/.gitea/workflows/terraform-destroy.yml
@@ -36,7 +36,7 @@ jobs:
          fi
      - name: Checkout repository
-        uses: https://gitea.com/actions/checkout@v4
+        uses: actions/checkout@v4
      - name: Create Terraform secret files
        working-directory: terraform
@@ -77,13 +77,13 @@ jobs:
          set -euo pipefail
          case "${{ inputs.target }}" in
            all)
-              TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -out=tfdestroy"
+              TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -out=tfdestroy"
              ;;
            control-planes)
-              TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy"
+              TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy"
              ;;
            workers)
-              TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy"
+              TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy"
              ;;
            *)
              echo "Invalid destroy target: ${{ inputs.target }}"
--- a/.gitea/workflows/terraform-plan.yml
+++ b/.gitea/workflows/terraform-plan.yml
@@ -17,7 +17,7 @@ jobs:
    steps:
      - name: Checkout repository
-        uses: https://gitea.com/actions/checkout@v4
+        uses: actions/checkout@v4
      - name: Create secrets.tfvars
        working-directory: terraform
--- a/nixos/kubeadm/README.md
+++ b/nixos/kubeadm/README.md
@@ -50,7 +50,7 @@ sudo nixos-rebuild switch --flake .#cp-1
 For remote target-host workflows, use your preferred deploy wrapper later
 (`nixos-rebuild --target-host ...` or deploy-rs/colmena).
-## Bootstrap runbook (kubeadm + kube-vip + Cilium)
+## Bootstrap runbook (kubeadm + kube-vip + Flannel)
 1. Apply Nix config on all nodes (`cp-*`, then `wk-*`).
 2. On `cp-1`, run:
@@ -62,14 +62,10 @@ sudo th-kubeadm-init
 This infers the control-plane VIP as `<node-subnet>.250` on `eth0`, creates the
 kube-vip static pod manifest, and runs `kubeadm init`.
-3. Install Cilium from `cp-1`:
+3. Install Flannel from `cp-1`:
 ```bash
-helm repo add cilium https://helm.cilium.io
+kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/v0.25.5/Documentation/kube-flannel.yml
 helm repo update
 helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true
 ```
 4. Generate join commands on `cp-1`:
@@ -98,7 +94,7 @@ kubectl get nodes -o wide
 kubectl -n kube-system get pods -o wide
 ```
-## Repeatable rebuild flow (recommended)
+## Fresh bootstrap flow (recommended)
 1. Copy and edit inventory:
@@ -107,7 +103,7 @@ cp ./scripts/inventory.example.env ./scripts/inventory.env
 $EDITOR ./scripts/inventory.env
 ```
-2. Rebuild all nodes and bootstrap/reconcile cluster:
+2. Rebuild all nodes and bootstrap a fresh cluster:
 ```bash
 ./scripts/rebuild-and-bootstrap.sh
@@ -141,15 +137,15 @@ For a full nuke/recreate lifecycle:
 - run Terraform destroy/apply for VMs first,
 - then run `./scripts/rebuild-and-bootstrap.sh` again.
-Node lists are discovered from Terraform outputs, so adding new workers/control
+Node lists now come directly from static Terraform outputs, so bootstrap no longer
-planes in Terraform is picked up automatically by the bootstrap/reconcile flow.
+depends on Proxmox guest-agent IP discovery or SSH subnet scanning.
 ## Optional Gitea workflow automation
 Primary flow:
 - Push to `master` triggers `.gitea/workflows/terraform-apply.yml`
- That workflow now does Terraform apply and then runs kubeadm rebuild/bootstrap reconciliation automatically
+- That workflow now does Terraform apply and then runs a fresh kubeadm bootstrap automatically
 Manual dispatch workflows are available:
@@ -164,9 +160,7 @@ Required repository secrets:
 Optional secrets:
 - `KUBEADM_SSH_USER` (defaults to `micqdf`)
- `KUBEADM_SUBNET_PREFIX` (optional, e.g. `10.27.27`; used for SSH-based IP discovery fallback)
+Node IPs are rendered directly from static Terraform outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets or SSH discovery fallbacks.
 Node IPs are auto-discovered from Terraform state outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets.
 ## Notes
--- a/nixos/kubeadm/bootstrap/controller.py
+++ b/nixos/kubeadm/bootstrap/controller.py
@@ -11,9 +11,6 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
 from pathlib import Path
 REMOTE_STATE_PATH = "/var/lib/terrahome/bootstrap-state.json"
 def run_local(cmd, check=True, capture=False):
    if isinstance(cmd, str):
        shell = True
@@ -102,7 +99,6 @@ class Controller:
        self.script_dir = Path(__file__).resolve().parent
        self.flake_dir = Path(self.env.get("FLAKE_DIR") or (self.script_dir.parent)).resolve()
        self.local_state_path = self.script_dir / "bootstrap-state-last.json"
        self.ssh_user = self.env.get("SSH_USER", "micqdf")
        self.ssh_candidates = self.env.get("SSH_USER_CANDIDATES", f"root {self.ssh_user}").split()
@@ -114,7 +110,9 @@ class Controller:
            "-o",
            "IdentitiesOnly=yes",
            "-o",
-            "StrictHostKeyChecking=accept-new",
+            "StrictHostKeyChecking=no",
            "-o",
            "UserKnownHostsFile=/dev/null",
            "-i",
            self.ssh_key,
        ]
@@ -124,7 +122,9 @@ class Controller:
        self.worker_parallelism = int(self.env.get("WORKER_PARALLELISM", "3"))
        self.fast_mode = self.env.get("FAST_MODE", "1")
        self.skip_rebuild = self.env.get("SKIP_REBUILD", "0") == "1"
-        self.force_reinit = False
+        self.force_reinit = True
        self.ssh_ready_retries = int(self.env.get("SSH_READY_RETRIES", "20"))
        self.ssh_ready_delay = int(self.env.get("SSH_READY_DELAY_SEC", "15"))
    def log(self, msg):
        print(f"==> {msg}")
@@ -134,13 +134,26 @@ class Controller:
        return run_local(full, check=check, capture=True)
    def detect_user(self, ip):
-        for user in self.ssh_candidates:
+        for attempt in range(1, self.ssh_ready_retries + 1):
-            proc = self._ssh(user, ip, "true", check=False)
+            for user in self.ssh_candidates:
-            if proc.returncode == 0:
+                proc = self._ssh(user, ip, "true", check=False)
-                self.active_ssh_user = user
+                if proc.returncode == 0:
-                self.log(f"Using SSH user '{user}' for {ip}")
+                    self.active_ssh_user = user
-                return
+                    self.log(f"Using SSH user '{user}' for {ip}")
-        raise RuntimeError(f"Unable to authenticate to {ip} with users: {', '.join(self.ssh_candidates)}")
+                    return
            if attempt < self.ssh_ready_retries:
                self.log(
                    f"SSH not ready on {ip} yet; retrying in {self.ssh_ready_delay}s "
                    f"({attempt}/{self.ssh_ready_retries})"
                )
                time.sleep(self.ssh_ready_delay)
        raise RuntimeError(
            "Unable to authenticate to "
            f"{ip} with users: {', '.join(self.ssh_candidates)}. "
            "If this is a freshly cloned VM, the Proxmox source template likely does not yet include the "
            "current cloud-init-capable NixOS template configuration from nixos/template-base. "
            "Terraform can only clone what exists in Proxmox; it cannot retrofit cloud-init support into an old template."
        )
    def remote(self, ip, cmd, check=True):
        ordered = [self.active_ssh_user] + [u for u in self.ssh_candidates if u != self.active_ssh_user]
@@ -161,53 +174,7 @@ class Controller:
        return last
    def prepare_known_hosts(self):
-        ssh_dir = Path.home() / ".ssh"
+        pass
        ssh_dir.mkdir(parents=True, exist_ok=True)
        (ssh_dir / "known_hosts").touch()
        run_local(["chmod", "700", str(ssh_dir)])
        run_local(["chmod", "600", str(ssh_dir / "known_hosts")])
        for ip in self.node_ips.values():
            run_local(["ssh-keygen", "-R", ip], check=False)
            run_local(f"ssh-keyscan -H {shlex.quote(ip)} >> {shlex.quote(str(ssh_dir / 'known_hosts'))}", check=False)
    def get_state(self):
        proc = self.remote(
            self.primary_ip,
            "sudo test -f /var/lib/terrahome/bootstrap-state.json && sudo cat /var/lib/terrahome/bootstrap-state.json || echo '{}'",
        )
        try:
            state = json.loads(proc.stdout.strip() or "{}")
        except Exception:
            state = {}
        return state
    def set_state(self, state):
        payload = json.dumps(state, sort_keys=True)
        b64 = base64.b64encode(payload.encode()).decode()
        self.remote(
            self.primary_ip,
            (
                "sudo mkdir -p /var/lib/terrahome && "
                f"echo {shlex.quote(b64)} | base64 -d | sudo tee {REMOTE_STATE_PATH} >/dev/null"
            ),
        )
        self.local_state_path.write_text(payload + "\n", encoding="utf-8")
    def mark_done(self, key):
        state = self.get_state()
        state[key] = True
        state["updated_at"] = int(time.time())
        self.set_state(state)
    def clear_done(self, keys):
        state = self.get_state()
        for key in keys:
            state.pop(key, None)
        state["updated_at"] = int(time.time())
        self.set_state(state)
    def stage_done(self, key):
        return bool(self.get_state().get(key))
    def prepare_remote_nix(self, ip):
        self.remote(ip, "sudo mkdir -p /etc/nix")
@@ -257,15 +224,11 @@ class Controller:
        raise RuntimeError(f"Rebuild failed permanently for {name}")
    def stage_preflight(self):
        if self.stage_done("preflight_done"):
            self.log("Preflight already complete")
            return
        self.prepare_known_hosts()
        self.detect_user(self.primary_ip)
        self.mark_done("preflight_done")
    def stage_rebuild(self):
-        if self.skip_rebuild and self.stage_done("nodes_rebuilt"):
+        if self.skip_rebuild:
            self.log("Node rebuild already complete")
            return
@@ -299,17 +262,6 @@ class Controller:
        if failures:
            raise RuntimeError(f"Worker rebuild failures: {failures}")
        # Rebuild can invalidate prior bootstrap stages; force reconciliation.
        self.force_reinit = True
        self.clear_done([
            "primary_initialized",
            "cni_installed",
            "control_planes_joined",
            "workers_joined",
            "verified",
        ])
        self.mark_done("nodes_rebuilt")
    def has_admin_conf(self):
        return self.remote(self.primary_ip, "sudo test -f /etc/kubernetes/admin.conf", check=False).returncode == 0
@@ -318,37 +270,52 @@ class Controller:
        return self.remote(self.primary_ip, cmd, check=False).returncode == 0
    def stage_init_primary(self):
-        if (not self.force_reinit) and self.stage_done("primary_initialized") and self.has_admin_conf() and self.cluster_ready():
+        self.log(f"Initializing primary control plane on {self.primary_cp}")
-            self.log("Primary control plane init already complete")
+        self.remote(self.primary_ip, "sudo th-kubeadm-init")
            return
        if (not self.force_reinit) and self.has_admin_conf() and self.cluster_ready():
            self.log("Existing cluster detected on primary control plane")
        else:
            self.log(f"Initializing primary control plane on {self.primary_cp}")
            self.remote(self.primary_ip, "sudo th-kubeadm-init")
        self.mark_done("primary_initialized")
    def stage_install_cni(self):
-        if self.stage_done("cni_installed") and self.cluster_ready():
+        self.log("Installing Flannel")
-            self.log("CNI install already complete")
+        manifest_path = self.script_dir.parent / "manifests" / "kube-flannel.yml"
-            return
+        manifest_b64 = base64.b64encode(manifest_path.read_bytes()).decode()
-        self.log("Installing or upgrading Cilium")
+
        self.remote(self.primary_ip, "sudo helm repo add cilium https://helm.cilium.io >/dev/null 2>&1 || true")
        self.remote(self.primary_ip, "sudo helm repo update >/dev/null")
        self.remote(self.primary_ip, "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf create namespace kube-system >/dev/null 2>&1 || true")
        self.remote(
            self.primary_ip,
-            "sudo KUBECONFIG=/etc/kubernetes/admin.conf helm upgrade --install cilium cilium/cilium --namespace kube-system --set kubeProxyReplacement=true",
+            (
                "sudo mkdir -p /var/lib/terrahome && "
                f"echo {shlex.quote(manifest_b64)} | base64 -d | sudo tee /var/lib/terrahome/kube-flannel.yml >/dev/null"
            ),
        )
-        self.mark_done("cni_installed")
+
        self.log("Waiting for API readiness before applying Flannel")
        ready = False
        for _ in range(30):
            if self.cluster_ready():
                ready = True
                break
            time.sleep(10)
        if not ready:
            raise RuntimeError("API server did not become ready before Flannel install")
        last_error = None
        for attempt in range(1, 6):
            proc = self.remote(
                self.primary_ip,
                "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf apply -f /var/lib/terrahome/kube-flannel.yml",
                check=False,
            )
            if proc.returncode == 0:
                return
            last_error = (proc.stdout or "") + ("\n" if proc.stdout and proc.stderr else "") + (proc.stderr or "")
            self.log(f"Flannel apply attempt {attempt}/5 failed; retrying in 15s")
            time.sleep(15)
        raise RuntimeError(f"Flannel apply failed after retries\n{last_error or ''}")
    def cluster_has_node(self, name):
        cmd = f"sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get node {shlex.quote(name)} >/dev/null 2>&1"
        return self.remote(self.primary_ip, cmd, check=False).returncode == 0
    def build_join_cmds(self):
        if not self.has_admin_conf():
            self.remote(self.primary_ip, "sudo th-kubeadm-init")
        join_cmd = self.remote(
            self.primary_ip,
            "sudo KUBECONFIG=/etc/kubernetes/admin.conf kubeadm token create --print-join-command",
@@ -361,9 +328,6 @@ class Controller:
        return join_cmd, cp_join
    def stage_join_control_planes(self):
        if self.stage_done("control_planes_joined"):
            self.log("Control-plane join already complete")
            return
        _, cp_join = self.build_join_cmds()
        for node in self.cp_names:
            if node == self.primary_cp:
@@ -373,14 +337,10 @@ class Controller:
                continue
            self.log(f"Joining control plane {node}")
            ip = self.node_ips[node]
-            node_join = f"{cp_join} --node-name {node}"
+            node_join = f"{cp_join} --node-name {node} --ignore-preflight-errors=NumCPU,HTTPProxyCIDR"
            self.remote(ip, f"sudo th-kubeadm-join-control-plane {shlex.quote(node_join)}")
        self.mark_done("control_planes_joined")
    def stage_join_workers(self):
        if self.stage_done("workers_joined"):
            self.log("Worker join already complete")
            return
        join_cmd, _ = self.build_join_cmds()
        for node in self.wk_names:
            if self.cluster_has_node(node):
@@ -388,18 +348,55 @@ class Controller:
                continue
            self.log(f"Joining worker {node}")
            ip = self.node_ips[node]
-            node_join = f"{join_cmd} --node-name {node}"
+            node_join = f"{join_cmd} --node-name {node} --ignore-preflight-errors=HTTPProxyCIDR"
            self.remote(ip, f"sudo th-kubeadm-join-worker {shlex.quote(node_join)}")
        self.mark_done("workers_joined")
    def stage_verify(self):
        if self.stage_done("verified"):
            self.log("Verification already complete")
            return
        self.log("Final node verification")
        try:
            self.remote(
                self.primary_ip,
                "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel rollout status ds/kube-flannel-ds --timeout=10m",
            )
        except Exception:
            self.log("Flannel rollout failed; collecting diagnostics")
            proc = self.remote(
                self.primary_ip,
                "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get ds -o wide || true",
                check=False,
            )
            print(proc.stdout)
            proc = self.remote(
                self.primary_ip,
                "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o wide || true",
                check=False,
            )
            print(proc.stdout)
            proc = self.remote(
                self.primary_ip,
                "for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do echo \"--- describe $p ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel describe $p || true; done",
                check=False,
            )
            print(proc.stdout)
            proc = self.remote(
                self.primary_ip,
                "for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do echo \"--- logs $p kube-flannel ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c kube-flannel --tail=120 || true; echo \"--- logs $p install-cni-plugin ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c install-cni-plugin --tail=120 || true; echo \"--- logs $p install-cni ---\"; sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs $p -c install-cni --tail=120 || true; done",
                check=False,
            )
            print(proc.stdout)
            proc = self.remote(
                self.primary_ip,
                "for p in $(sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel get pods -o name 2>/dev/null); do sudo kubectl --kubeconfig /etc/kubernetes/admin.conf -n kube-flannel logs --tail=120 $p || true; done",
                check=False,
            )
            print(proc.stdout)
            raise
        self.remote(
            self.primary_ip,
            "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf wait --for=condition=Ready nodes --all --timeout=10m",
        )
        proc = self.remote(self.primary_ip, "sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide")
        print(proc.stdout)
        self.mark_done("verified")
    def reconcile(self):
        self.stage_preflight()
--- a/nixos/kubeadm/manifests/kube-flannel.yml
+++ b/nixos/kubeadm/manifests/kube-flannel.yml
@@ -0,0 +1,212 @@
 ---
 kind: Namespace
 apiVersion: v1
 metadata:
  name: kube-flannel
  labels:
    k8s-app: flannel
    pod-security.kubernetes.io/enforce: privileged
 ---
 kind: ClusterRole
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  labels:
    k8s-app: flannel
  name: flannel
 rules:
 - apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
 - apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
 - apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
 ---
 kind: ClusterRoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
  labels:
    k8s-app: flannel
  name: flannel
 roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flannel
 subjects:
 - kind: ServiceAccount
  name: flannel
  namespace: kube-flannel
 ---
 apiVersion: v1
 kind: ServiceAccount
 metadata:
  labels:
    k8s-app: flannel
  name: flannel
  namespace: kube-flannel
 ---
 kind: ConfigMap
 apiVersion: v1
 metadata:
  name: kube-flannel-cfg
  namespace: kube-flannel
  labels:
    tier: node
    k8s-app: flannel
    app: flannel
 data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "EnableNFTables": false,
      "Backend": {
        "Type": "vxlan"
      }
    }
 ---
 apiVersion: apps/v1
 kind: DaemonSet
 metadata:
  name: kube-flannel-ds
  namespace: kube-flannel
  labels:
    tier: node
    app: flannel
    k8s-app: flannel
 spec:
  selector:
    matchLabels:
      app: flannel
  template:
    metadata:
      labels:
        tier: node
        app: flannel
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      hostNetwork: true
      priorityClassName: system-node-critical
      tolerations:
      - operator: Exists
        effect: NoSchedule
      serviceAccountName: flannel
      initContainers:
      - name: install-cni-plugin
        image: docker.io/flannel/flannel-cni-plugin:v1.5.1-flannel1
        command:
        - cp
        args:
        - -f
        - /flannel
        - /opt/cni/bin/flannel
        volumeMounts:
        - name: cni-plugin
          mountPath: /opt/cni/bin
      - name: install-cni
        image: docker.io/flannel/flannel:v0.25.5
        command:
        - cp
        args:
        - -f
        - /etc/kube-flannel/cni-conf.json
        - /etc/cni/net.d/10-flannel.conflist
        volumeMounts:
        - name: cni
          mountPath: /etc/cni/net.d
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      containers:
      - name: kube-flannel
        image: docker.io/flannel/flannel:v0.25.5
        command:
        - /opt/bin/flanneld
        args:
        - --ip-masq
        - --kube-subnet-mgr
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
        securityContext:
          privileged: false
          capabilities:
            add: ["NET_ADMIN", "NET_RAW"]
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: EVENT_QUEUE_DEPTH
          value: "5000"
        volumeMounts:
        - name: run
          mountPath: /run/flannel
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
        - name: xtables-lock
          mountPath: /run/xtables.lock
      volumes:
      - name: run
        hostPath:
          path: /run/flannel
          type: DirectoryOrCreate
      - name: cni-plugin
        hostPath:
          path: /opt/cni/bin
          type: DirectoryOrCreate
      - name: cni
        hostPath:
          path: /etc/cni/net.d
          type: DirectoryOrCreate
      - name: flannel-cfg
        configMap:
          name: kube-flannel-cfg
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
--- a/nixos/kubeadm/modules/k8s-common.nix
+++ b/nixos/kubeadm/modules/k8s-common.nix
@@ -165,7 +165,8 @@ in
        name: "KUBEADM_NODE_NAME"
        criSocket: unix:///run/containerd/containerd.sock
        kubeletExtraArgs:
-          hostname-override: "KUBEADM_NODE_NAME"
+          - name: hostname-override
            value: "KUBEADM_NODE_NAME"
      ---
      apiVersion: kubeadm.k8s.io/v1beta4
      kind: ClusterConfiguration
@@ -174,14 +175,6 @@ in
        podSubnet: "KUBEADM_POD_SUBNET"
        serviceSubnet: "KUBEADM_SERVICE_SUBNET"
        dnsDomain: "KUBEADM_DNS_DOMAIN"
      ---
      apiVersion: kubelet.config.k8s.io/v1beta1
      kind: KubeletConfiguration
      authentication:
        webhook:
          enabled: false
      authorization:
        mode: AlwaysAllow
      KUBEADMCONFIG
      sed -i "s|KUBEADM_ENDPOINT|$vip:6443|g" /tmp/kubeadm/init-config.yaml
@@ -209,27 +202,55 @@ in
      echo "==> kube-vip manifest kubeconfig mount"
      grep -E 'mountPath:|path:' /etc/kubernetes/manifests/kube-vip.yaml | grep -E 'kubernetes/(admin|super-admin)\.conf' || true
-      env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm init \
+      KUBEADM_INIT_LOG=/tmp/kubeadm-init.log
      if ! env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm init \
        --config /tmp/kubeadm/init-config.yaml \
        --upload-certs \
-        --ignore-preflight-errors=NumCPU,HTTPProxyCIDR,Port-10250 || {
+        --ignore-preflight-errors=NumCPU,HTTPProxyCIDR,Port-10250 2>&1 | tee "$KUBEADM_INIT_LOG"; then
-        echo "==> kubeadm init failed, checking pod status:"
+        if grep -q "error writing CRISocket for this node: nodes" "$KUBEADM_INIT_LOG" && [ -f /etc/kubernetes/admin.conf ]; then
-        crictl pods || true
+          echo "==> kubeadm hit CRISocket race; waiting for node registration"
-        crictl ps -a || true
+          echo "==> forcing kubelet restart to pick bootstrap flags"
-        echo "==> kube-vip containers:"
+          systemctl daemon-reload || true
-        crictl ps -a --name kube-vip || true
+          systemctl restart kubelet || true
-        echo "==> kube-vip logs:"
+          sleep 3
-        for container_id in $(crictl ps -a --name kube-vip -q 2>/dev/null); do
+          echo "==> kubelet bootstrap flags"
-          echo "--- kube-vip container $container_id ---"
+          cat /var/lib/kubelet/kubeadm-flags.env || true
-          crictl logs "$container_id" 2>/dev/null || true
+          registered=0
-          crictl inspect "$container_id" 2>/dev/null | jq -r '.status | "exitCode=\(.exitCode) reason=\(.reason // "") message=\(.message // "")"' || true
+          for i in $(seq 1 60); do
-        done
+            if KUBECONFIG=/etc/kubernetes/admin.conf kubectl get node "$node_name" >/dev/null 2>&1; then
-        echo "==> Checking if VIP is bound:"
+              echo "==> node $node_name registered; uploading kubelet config"
-        ip -4 addr show | grep "$vip" || echo "VIP NOT BOUND"
+              env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm init phase upload-config kubelet --config /tmp/kubeadm/init-config.yaml
-        echo "==> kubelet logs:"
+              registered=1
-        journalctl -xeu kubelet --no-pager -n 50
+              break
-        exit 1
+            fi
-      }
+            sleep 2
          done
          if [ "$registered" -ne 1 ]; then
            echo "==> node $node_name did not register after kubeadm init failure"
            KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes -o wide || true
            echo "==> kubelet logs (registration hints)"
            journalctl -u kubelet --no-pager -n 120 | grep -Ei "register|node|bootstrap|certificate|forbidden|unauthorized|refused|x509" || true
            exit 1
          fi
        else
          echo "==> kubeadm init failed, checking pod status:"
          crictl pods || true
          crictl ps -a || true
          echo "==> kube-vip containers:"
          crictl ps -a --name kube-vip || true
          echo "==> kube-vip logs:"
          for container_id in $(crictl ps -a --name kube-vip -q 2>/dev/null); do
            echo "--- kube-vip container $container_id ---"
            crictl logs "$container_id" 2>/dev/null || true
            crictl inspect "$container_id" 2>/dev/null | jq -r '.status | "exitCode=\(.exitCode) reason=\(.reason // "") message=\(.message // "")"' || true
          done
          echo "==> Checking if VIP is bound:"
          ip -4 addr show | grep "$vip" || echo "VIP NOT BOUND"
          echo "==> kubelet logs:"
          journalctl -xeu kubelet --no-pager -n 50
          exit 1
        fi
      fi
      echo "==> Waiting for kube-vip to claim VIP $vip"
      for i in $(seq 1 90); do
@@ -317,12 +338,16 @@ in
        > /etc/kubernetes/manifests/kube-vip.yaml
      rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
      rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
      rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
      rm -rf /var/lib/kubelet/pki
      systemctl unmask kubelet || true
      systemctl stop kubelet || true
      systemctl enable kubelet || true
      systemctl reset-failed kubelet || true
      systemctl daemon-reload
      env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
      eval "$1"
    '')
@@ -335,12 +360,16 @@ in
      fi
      rm -f /var/lib/kubelet/config.yaml /var/lib/kubelet/kubeadm-flags.env
      rm -f /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf
      rm -f /var/lib/kubelet/kubeconfig /var/lib/kubelet/instance-config.yaml
      rm -rf /var/lib/kubelet/pki
      systemctl unmask kubelet || true
      systemctl stop kubelet || true
      systemctl enable kubelet || true
      systemctl reset-failed kubelet || true
      systemctl daemon-reload
      env -i PATH=/run/current-system/sw/bin:/usr/bin:/bin kubeadm reset -f || true
      eval "$1"
    '')
@@ -355,6 +384,7 @@ in
  systemd.services.kubelet = {
    description = "Kubernetes Kubelet";
    wantedBy = [ "multi-user.target" ];
    path = [ pkgs.util-linux ];
    wants = [ "network-online.target" ];
    after = [ "containerd.service" "network-online.target" ];
    serviceConfig = {
@@ -367,18 +397,22 @@ in
        "-/var/lib/kubelet/kubeadm-flags.env"
        "-/etc/default/kubelet"
      ];
-      ExecStart = "${pinnedK8s}/bin/kubelet \$KUBELET_CONFIG_ARGS \$KUBELET_KUBEADM_ARGS \$KUBELET_EXTRA_ARGS";
+      ExecStart = "${pinnedK8s}/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf \$KUBELET_CONFIG_ARGS \$KUBELET_KUBEADM_ARGS \$KUBELET_EXTRA_ARGS";
      Restart = "on-failure";
      RestartSec = "10";
    };
    unitConfig = {
      ConditionPathExists = "/var/lib/kubelet/config.yaml";
      ConditionPathExistsGlob = "/etc/kubernetes/*kubelet.conf";
    };
  };
  systemd.tmpfiles.rules = [
    "d /etc/kubernetes 0755 root root -"
    "d /etc/kubernetes/manifests 0755 root root -"
    "d /etc/cni/net.d 0755 root root -"
    "d /opt/cni/bin 0755 root root -"
    "d /run/flannel 0755 root root -"
    "d /var/lib/kubelet 0755 root root -"
    "d /var/lib/kubelet/pki 0755 root root -"
  ];
--- a/nixos/kubeadm/scripts/discover-inventory-from-ssh.py
+++ b/nixos/kubeadm/scripts/discover-inventory-from-ssh.py
@@ -96,8 +96,19 @@ def main() -> int:
    prefix = derive_prefix(payload)
    start = int(os.environ.get("KUBEADM_SUBNET_START", "2"))
    end = int(os.environ.get("KUBEADM_SUBNET_END", "254"))
    vip_suffix = int(os.environ.get("KUBEADM_CONTROL_PLANE_VIP_SUFFIX", "250"))
-    scan_ips = [str(ipaddress.IPv4Address(f"{prefix}.{i}")) for i in range(start, end + 1)]
+    def is_vip_ip(ip: str) -> bool:
        try:
            return int(ip.split(".")[-1]) == vip_suffix
        except Exception:
            return False
    scan_ips = [
        str(ipaddress.IPv4Address(f"{prefix}.{i}"))
        for i in range(start, end + 1)
        if i != vip_suffix
    ]
    found: Dict[str, str] = {}
    vmid_to_name: Dict[str, str] = {}
    for name, vmid in payload.get("control_plane_vm_ids", {}).get("value", {}).items():
@@ -106,6 +117,7 @@ def main() -> int:
        vmid_to_name[str(vmid)] = name
    seen_hostnames: Dict[str, str] = {}
    seen_ips: Dict[str, Tuple[str, str]] = {}
    def run_pass(pass_timeout: int, pass_workers: int) -> None:
        with concurrent.futures.ThreadPoolExecutor(max_workers=pass_workers) as pool:
@@ -117,12 +129,19 @@ def main() -> int:
                host, ip, serial = result
                if host not in seen_hostnames:
                    seen_hostnames[host] = ip
-                if host in target_names and host not in found:
+                if ip not in seen_ips:
-                    found[host] = ip
+                    seen_ips[ip] = (host, serial)
-                elif serial in vmid_to_name:
+                target = None
                if serial in vmid_to_name:
                    inferred = vmid_to_name[serial]
-                    if inferred not in found:
+                    target = inferred
-                        found[inferred] = ip
+                elif host in target_names:
                    target = host
                if target:
                    existing = found.get(target)
                    if existing is None or (is_vip_ip(existing) and not is_vip_ip(ip)):
                        found[target] = ip
                if all(name in found for name in target_names):
                    return
@@ -131,11 +150,25 @@ def main() -> int:
        # Slower second pass for busy runners/networks.
        run_pass(max(timeout_sec + 2, 8), max(8, max_workers // 2))
    # Heuristic fallback: if nodes still missing, assign from remaining SSH-reachable
    # IPs not already used, ordered by IP. This helps when cloned nodes temporarily
    # share a generic hostname (e.g. "flex") and DMI serial mapping is unavailable.
    missing = sorted([n for n in target_names if n not in found])
    if missing:
        used_ips = set(found.values())
        candidates = sorted(ip for ip in seen_ips.keys() if ip not in used_ips)
        if len(candidates) >= len(missing):
            for name, ip in zip(missing, candidates):
                found[name] = ip
    missing = sorted([n for n in target_names if n not in found])
    if missing:
        discovered = ", ".join(sorted(seen_hostnames.keys())[:20])
        if discovered:
            sys.stderr.write(f"Discovered hostnames during scan: {discovered}\n")
        if seen_ips:
            sample = ", ".join(f"{ip}={meta[0]}" for ip, meta in list(sorted(seen_ips.items()))[:20])
            sys.stderr.write(f"SSH-reachable IPs: {sample}\n")
        raise SystemExit(
            "Failed SSH-based IP discovery for nodes: " + ", ".join(missing) +
            f" (scanned {prefix}.{start}-{prefix}.{end})"
--- a/nixos/template-base/configuration.nix
+++ b/nixos/template-base/configuration.nix
@@ -11,6 +11,7 @@ in
  networking.hostName = "k8s-base-template";
  networking.useDHCP = lib.mkDefault true;
  networking.useNetworkd = true;
  networking.nameservers = [ "1.1.1.1" "8.8.8.8" ];
  boot.loader.systemd-boot.enable = lib.mkForce false;
@@ -20,6 +21,8 @@ in
  };
  services.qemuGuest.enable = true;
  services.cloud-init.enable = true;
  services.cloud-init.network.enable = true;
  services.openssh.enable = true;
  services.openssh.settings = {
    PasswordAuthentication = false;
--- a/terraform/main.tf
+++ b/terraform/main.tf
@@ -9,6 +9,15 @@ terraform {
  }
 }
 locals {
  control_plane_ipconfig = [
    for ip in var.control_plane_ips : "ip=${ip}/${var.network_prefix_length},gw=${var.network_gateway}"
  ]
  worker_ipconfig = [
    for ip in var.worker_ips : "ip=${ip}/${var.network_prefix_length},gw=${var.network_gateway}"
  ]
 }
 provider "proxmox" {
  pm_api_url          = var.pm_api_url
  pm_api_token_id     = var.pm_api_token_id
@@ -35,7 +44,7 @@ resource "proxmox_vm_qemu" "control_planes" {
  scsihw    = "virtio-scsi-pci"
  boot      = "order=scsi0"
  bootdisk  = "scsi0"
-  ipconfig0 = "ip=dhcp"
+  ipconfig0 = local.control_plane_ipconfig[count.index]
  ciuser    = "micqdf"
  sshkeys   = var.SSH_KEY_PUBLIC
@@ -90,7 +99,7 @@ resource "proxmox_vm_qemu" "workers" {
  scsihw    = "virtio-scsi-pci"
  boot      = "order=scsi0"
  bootdisk  = "scsi0"
-  ipconfig0 = "ip=dhcp"
+  ipconfig0 = local.worker_ipconfig[count.index]
  ciuser    = "micqdf"
  sshkeys   = var.SSH_KEY_PUBLIC
--- a/terraform/outputs.tf
+++ b/terraform/outputs.tf
@@ -11,8 +11,8 @@ output "control_plane_vm_names" {
 output "control_plane_vm_ipv4" {
  value = {
-    for vm in proxmox_vm_qemu.control_planes :
+    for i in range(var.control_plane_count) :
-    vm.name => vm.default_ipv4_address
+    proxmox_vm_qemu.control_planes[i].name => var.control_plane_ips[i]
  }
 }
@@ -29,7 +29,7 @@ output "worker_vm_names" {
 output "worker_vm_ipv4" {
  value = {
-    for vm in proxmox_vm_qemu.workers :
+    for i in range(var.worker_count) :
-    vm.name => vm.default_ipv4_address
+    proxmox_vm_qemu.workers[i].name => var.worker_ips[i]
  }
 }
--- a/terraform/terraform.tfvars
+++ b/terraform/terraform.tfvars
@@ -17,3 +17,9 @@ control_plane_disk_size = "80G"
 worker_cores     = [4, 4, 4]
 worker_memory_mb = [12288, 12288, 12288]
 worker_disk_size = "120G"
 network_prefix_length = 10
 network_gateway       = "10.27.27.1"
 control_plane_ips = ["10.27.27.50", "10.27.27.51", "10.27.27.49"]
 worker_ips        = ["10.27.27.47", "10.27.27.46", "10.27.27.48"]
--- a/terraform/variables.tf
+++ b/terraform/variables.tf
@@ -87,6 +87,40 @@ variable "worker_disk_size" {
  description = "Disk size for worker VMs"
 }
 variable "network_prefix_length" {
  type        = number
  default     = 10
  description = "CIDR prefix length for static VM addresses"
 }
 variable "network_gateway" {
  type        = string
  default     = "10.27.27.1"
  description = "Gateway for static VM addresses"
 }
 variable "control_plane_ips" {
  type        = list(string)
  default     = ["10.27.27.50", "10.27.27.51", "10.27.27.49"]
  description = "Static IPv4 addresses for control plane VMs"
  validation {
    condition     = length(var.control_plane_ips) == 3
    error_message = "control_plane_ips must contain exactly 3 IPs."
  }
 }
 variable "worker_ips" {
  type        = list(string)
  default     = ["10.27.27.47", "10.27.27.46", "10.27.27.48"]
  description = "Static IPv4 addresses for worker VMs"
  validation {
    condition     = length(var.worker_ips) == 3
    error_message = "worker_ips must contain exactly 3 IPs."
  }
 }
 variable "bridge" {
  type = string
 }
Author	SHA1	Message	Date
micqdf	5bfc135350	Merge pull request 'fix: ignore stale SSH host keys for ephemeral homelab VMs' (#130 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 19m24s Details Reviewed-on: #130	2026-03-09 03:45:11 +00:00
MichaelFisher1997	63213a4bc3	fix: ignore stale SSH host keys for ephemeral homelab VMs All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Fresh destroy/recreate cycles change VM host keys, which was breaking bootstrap after rebuilds. Use a disposable known-hosts policy in the controller SSH options so automation does not fail on expected key rotation.	2026-03-09 03:16:18 +00:00
micqdf	e4243c7667	Merge pull request 'fix: keep DHCP enabled by default on template VM' (#129 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 1h50m42s Details Reviewed-on: #129	2026-03-08 22:03:17 +00:00
MichaelFisher1997	33bb0ffb17	fix: keep DHCP enabled by default on template VM All checks were successful Terraform Plan / Terraform Plan (push) Successful in 14s Details The template machine can lose connectivity when rebuilt directly because it has no cloud-init network data during template maintenance. Restore DHCP as the default for the template itself while keeping cloud-init + networkd enabled so cloned VMs can still consume injected network settings.	2026-03-08 20:12:03 +00:00
micqdf	7434a65590	Merge pull request 'stage' (#128 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 6m54s Details Reviewed-on: #128	2026-03-08 18:06:46 +00:00
MichaelFisher1997	cd8e538c51	ci: switch checkout action source away from gitea.com mirror All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details The gitea.com checkout action mirror is timing out during workflow startup. Use actions/checkout@v4 directly so jobs do not fail before any repository logic runs.	2026-03-08 13:36:21 +00:00
MichaelFisher1997	808c290c71	chore: clarify stale template cloud-init failure message Some checks failed Terraform Plan / Terraform Plan (push) Failing after 31s Details Make SSH bootstrap failures explain the real root cause when fresh clones never accept the injected user/key: the Proxmox source template itself still needs the updated cloud-init-capable NixOS configuration.	2026-03-08 13:16:37 +00:00
micqdf	15e6471e7e	Merge pull request 'fix: enable cloud-init networking in NixOS template' (#127 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 7m10s Details Reviewed-on: #127	2026-03-08 05:33:57 +00:00
MichaelFisher1997	79a4c941e5	fix: enable cloud-init networking in NixOS template All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Freshly recreated VMs were reachable but did not accept the injected SSH key, which indicates Proxmox cloud-init settings were not being applied. Enable cloud-init and cloud-init network handling in the base template so static IPs, hostname, ciuser, and SSH keys take effect on first boot.	2026-03-08 05:16:19 +00:00
micqdf	e9bac70cae	Merge pull request 'fix: wait for SSH readiness after VM provisioning' (#126 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 6m56s Details Reviewed-on: #126	2026-03-08 05:04:43 +00:00
MichaelFisher1997	4c167f618a	fix: wait for SSH readiness after VM provisioning All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.	2026-03-08 05:00:39 +00:00
micqdf	97295a7071	Merge pull request 'ci: speed up Terraform destroy plan by skipping refresh' (#125 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 7m0s Details Reviewed-on: #125	2026-03-08 04:47:02 +00:00
MichaelFisher1997	7bc861b3e8	ci: speed up Terraform destroy plan by skipping refresh All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Use terraform plan -refresh=false for destroy workflows so manual NUKE runs do not spend minutes refreshing Proxmox VM state before building the destroy plan.	2026-03-08 04:37:52 +00:00
micqdf	6ca189b32c	Merge pull request 'fix: vendor Flannel manifest and harden CNI bootstrap timing' (#124 ) from stage into master All checks were successful Terraform Apply / Terraform Apply (push) Successful in 15m11s Details Reviewed-on: #124	2026-03-08 04:10:47 +00:00
MichaelFisher1997	b7b364a112	fix: vendor Flannel manifest and harden CNI bootstrap timing All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Stop depending on GitHub during cluster bring-up by shipping the Flannel manifest in-repo, ensure required host paths exist on NixOS nodes, and wait/retry against a stable API before applying the CNI. This removes the TLS handshake timeout failure mode and makes early network bootstrap deterministic.	2026-03-08 03:24:16 +00:00
micqdf	2aa9950f59	Merge pull request 'fix: add mount utility to kubelet service PATH' (#123 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 11m10s Details Reviewed-on: #123	2026-03-08 02:16:23 +00:00
MichaelFisher1997	bd866f7dac	fix: add mount utility to kubelet service PATH All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Flannel pods were stuck because kubelet could not execute mount for projected service account volumes on NixOS. Add util-linux to the kubelet systemd PATH so mount is available during volume setup.	2026-03-07 14:18:20 +00:00
micqdf	c1f86483ad	Merge pull request 'debug: print detailed Flannel pod diagnostics on rollout timeout' (#122 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 23m50s Details Reviewed-on: #122	2026-03-07 12:31:43 +00:00
micqdf	0cce4bcf72	Merge branch 'master' into stage All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details	2026-03-07 12:22:01 +00:00
MichaelFisher1997	065567210e	debug: print detailed Flannel pod diagnostics on rollout timeout All checks were successful Terraform Plan / Terraform Plan (push) Successful in 18s Details When kube-flannel daemonset rollout stalls, print pod descriptions and per-container logs for the init containers and main flannel container so the next failure shows the actual cause instead of only Init:0/2.	2026-03-07 12:19:21 +00:00
micqdf	c5f0b1ac37	Merge pull request 'stage' (#121 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 30m28s Details Reviewed-on: #121	2026-03-07 01:01:38 +00:00
micqdf	e740d47011	Merge branch 'master' into stage All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details	2026-03-07 00:57:47 +00:00
MichaelFisher1997	d9d3976c4c	fix: use self-contained Terraform variable validations All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Terraform variable validation blocks can only reference the variable under validation. Replace count-based checks with fixed-length validations for the current 3 control planes and 3 workers.	2026-03-07 00:54:51 +00:00
MichaelFisher1997	a0b07816b9	refactor: simplify homelab bootstrap around static IPs and fresh runs Some checks failed Terraform Plan / Terraform Plan (push) Failing after 10s Details Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.	2026-03-07 00:52:35 +00:00
micqdf	d964ff8b50	Merge pull request 'fix: point Cilium directly at API server and print rollout diagnostics' (#120 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 26m43s Details Reviewed-on: #120	2026-03-05 01:25:52 +00:00
MichaelFisher1997	e06b2c692e	fix: point Cilium directly at API server and print rollout diagnostics All checks were successful Terraform Plan / Terraform Plan (push) Successful in 18s Details Set Cilium k8sServiceHost/k8sServicePort to the primary control-plane API endpoint to avoid in-cluster service routing dependency during early bootstrap. Also print cilium daemonset/pod/log diagnostics when rollout times out.	2026-03-05 01:21:21 +00:00
micqdf	c48bbddef3	Merge pull request 'fix: stabilize Cilium install defaults and add rollout diagnostics' (#119 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 26m43s Details Reviewed-on: #119	2026-03-05 00:52:04 +00:00
MichaelFisher1997	ca54c44fa4	fix: stabilize Cilium install defaults and add rollout diagnostics All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Set Cilium kubeProxyReplacement from env (default false for homelab stability) and collect cilium daemonset/pod/log diagnostics when rollout times out during verification.	2026-03-05 00:48:41 +00:00
micqdf	8bda08be07	Merge pull request 'fix: hard-reset nodes before kubeadm join retries' (#118 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 29m30s Details Reviewed-on: #118	2026-03-05 00:16:31 +00:00
MichaelFisher1997	0778de9719	fix: hard-reset nodes before kubeadm join retries All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Before control-plane and worker joins, remove stale kubelet/kubernetes identity files and run kubeadm reset -f. This prevents preflight failures like FileAvailable--etc-kubernetes-kubelet.conf during repeated reconcile attempts.	2026-03-04 23:38:15 +00:00
micqdf	92f0658995	Merge pull request 'fix: add heuristic SSH inventory fallback for generic hostnames' (#117 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 19m52s Details Reviewed-on: #117	2026-03-04 23:13:08 +00:00
MichaelFisher1997	fc4eb1bc6e	fix: add heuristic SSH inventory fallback for generic hostnames All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details When Proxmox guest-agent IPs are empty and SSH discovery returns duplicate generic hostnames (e.g. flex), assign remaining missing nodes from unmatched SSH-reachable IPs in deterministic order. Also emit SSH-reachable IP diagnostics on failure.	2026-03-04 23:07:45 +00:00
micqdf	4b017364c8	Merge pull request 'fix: wait for Cilium and node readiness before marking bootstrap success' (#116 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 8m47s Details Reviewed-on: #116	2026-03-04 22:57:39 +00:00
MichaelFisher1997	a70de061b0	fix: wait for Cilium and node readiness before marking bootstrap success All checks were successful Terraform Plan / Terraform Plan (push) Successful in 18s Details Update verification stage to block on cilium daemonset rollout and all nodes reaching Ready. This prevents workflows from reporting success while the cluster is still NotReady immediately after join.	2026-03-04 22:26:43 +00:00
micqdf	9d98f56725	Merge pull request 'fix: add join preflight ignores for homelab control planes' (#115 ) from stage into master All checks were successful Terraform Apply / Terraform Apply (push) Successful in 44m43s Details Reviewed-on: #115	2026-03-04 21:13:02 +00:00
MichaelFisher1997	5ddd00f711	fix: add join preflight ignores for homelab control planes All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details Append --ignore-preflight-errors=NumCPU,HTTPProxyCIDR to control-plane join commands and HTTPProxyCIDR to worker joins so kubeadm join does not fail on known single-CPU/proxy CIDR checks in this environment.	2026-03-04 21:09:27 +00:00
micqdf	5af4021228	Merge pull request 'fix: require kubelet kubeconfig before starting service' (#114 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 16m56s Details Reviewed-on: #114	2026-03-04 20:46:48 +00:00
MichaelFisher1997	034869347a	fix: require kubelet kubeconfig before starting service All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Inline kubelet bootstrap/kubeconfig flags in ExecStart and gate startup on /etc/kubernetes/*kubelet.conf in addition to config.yaml. This prevents kubelet entering standalone mode with webhook auth enabled when no client config is present.	2026-03-04 20:45:47 +00:00
micqdf	50d0d99332	Merge pull request 'stage' (#113 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 18m7s Details Reviewed-on: #113	2026-03-04 19:32:40 +00:00
MichaelFisher1997	f0093deedc	fix: avoid assigning control-plane VIP as node SSH address All checks were successful Terraform Plan / Terraform Plan (push) Successful in 15s Details Exclude the configured VIP suffix from subnet scans and prefer non-VIP IPs when multiple SSH endpoints resolve to the same node. This prevents cp-1 being discovered as .250 and later failing SSH commands against the floating VIP.	2026-03-04 19:26:37 +00:00
MichaelFisher1997	6b6ca021c9	fix: add kubelet bootstrap kubeconfig args to systemd unit All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Include KUBELET_KUBECONFIG_ARGS in kubelet ExecStart so kubelet can authenticate with bootstrap-kubelet.conf/kubelet.conf and register node objects during kubeadm init.	2026-03-04 19:26:07 +00:00
micqdf	c034f7975c	Merge pull request 'stage' (#112 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 28m53s Details Reviewed-on: #112	2026-03-04 18:51:53 +00:00
micqdf	90ef0ec33f	Merge branch 'master' into stage All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details	2026-03-04 18:42:22 +00:00
MichaelFisher1997	ba6cf42c04	fix: restart kubelet during CRISocket recovery and add registration diagnostics All checks were successful Terraform Plan / Terraform Plan (push) Successful in 16s Details When kubeadm init fails at upload-config/kubelet due missing node object, explicitly restart kubelet to ensure bootstrap flags are loaded before waiting for node registration. Add kubelet flag dump and focused registration log output to surface auth/cert errors.	2026-03-04 18:37:50 +00:00
MichaelFisher1997	3cd0c70727	fix: stop overriding kubelet config in kubeadm init All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Remove custom KubeletConfiguration from init config so kubeadm uses default kubelet authn/authz settings and bootstrap registration path. This avoids the standalone-style kubelet behavior where the node never appears in the API.	2026-03-04 18:35:34 +00:00
micqdf	3281ebd216	Merge pull request 'fix: recover from kubeadm CRISocket node-registration race' (#111 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 18m6s Details Reviewed-on: #111	2026-03-04 03:03:17 +00:00
MichaelFisher1997	d2dd6105a6	fix: recover from kubeadm CRISocket node-registration race All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details Handle kubeadm init failures where upload-config/kubelet runs before the node object exists. When that specific error occurs, wait for cp-1 registration and run upload-config kubelet phase explicitly instead of aborting immediately.	2026-03-04 03:00:34 +00:00
micqdf	981afc509a	Merge pull request 'fix: use kubeadm v1beta4 list format for kubeletExtraArgs' (#110 ) from stage into master Some checks failed Terraform Apply / Terraform Apply (push) Failing after 19m48s Details Reviewed-on: #110	2026-03-04 02:32:22 +00:00
MichaelFisher1997	b3c975bd73	fix: use kubeadm v1beta4 list format for kubeletExtraArgs All checks were successful Terraform Plan / Terraform Plan (push) Successful in 17s Details kubeadm v1beta4 expects nodeRegistration.kubeletExtraArgs as a list of name/value args, not a map. Switch hostname-override to the correct structure so init config unmarshals successfully.	2026-03-04 02:00:07 +00:00