13 Commits

Author SHA1 Message Date
5bfc135350 Merge pull request 'fix: ignore stale SSH host keys for ephemeral homelab VMs' (#130) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 19m24s
Reviewed-on: #130
2026-03-09 03:45:11 +00:00
63213a4bc3 fix: ignore stale SSH host keys for ephemeral homelab VMs
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Fresh destroy/recreate cycles change VM host keys, which was breaking bootstrap after rebuilds. Use a disposable known-hosts policy in the controller SSH options so automation does not fail on expected key rotation.
2026-03-09 03:16:18 +00:00
e4243c7667 Merge pull request 'fix: keep DHCP enabled by default on template VM' (#129) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 1h50m42s
Reviewed-on: #129
2026-03-08 22:03:17 +00:00
33bb0ffb17 fix: keep DHCP enabled by default on template VM
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 14s
The template machine can lose connectivity when rebuilt directly because it has no cloud-init network data during template maintenance. Restore DHCP as the default for the template itself while keeping cloud-init + networkd enabled so cloned VMs can still consume injected network settings.
2026-03-08 20:12:03 +00:00
7434a65590 Merge pull request 'stage' (#128) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m54s
Reviewed-on: #128
2026-03-08 18:06:46 +00:00
cd8e538c51 ci: switch checkout action source away from gitea.com mirror
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
The gitea.com checkout action mirror is timing out during workflow startup. Use actions/checkout@v4 directly so jobs do not fail before any repository logic runs.
2026-03-08 13:36:21 +00:00
808c290c71 chore: clarify stale template cloud-init failure message
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 31s
Make SSH bootstrap failures explain the real root cause when fresh clones never accept the injected user/key: the Proxmox source template itself still needs the updated cloud-init-capable NixOS configuration.
2026-03-08 13:16:37 +00:00
15e6471e7e Merge pull request 'fix: enable cloud-init networking in NixOS template' (#127) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m10s
Reviewed-on: #127
2026-03-08 05:33:57 +00:00
79a4c941e5 fix: enable cloud-init networking in NixOS template
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Freshly recreated VMs were reachable but did not accept the injected SSH key, which indicates Proxmox cloud-init settings were not being applied. Enable cloud-init and cloud-init network handling in the base template so static IPs, hostname, ciuser, and SSH keys take effect on first boot.
2026-03-08 05:16:19 +00:00
e9bac70cae Merge pull request 'fix: wait for SSH readiness after VM provisioning' (#126) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 6m56s
Reviewed-on: #126
2026-03-08 05:04:43 +00:00
4c167f618a fix: wait for SSH readiness after VM provisioning
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s
Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.
2026-03-08 05:00:39 +00:00
97295a7071 Merge pull request 'ci: speed up Terraform destroy plan by skipping refresh' (#125) from stage into master
Some checks failed
Terraform Apply / Terraform Apply (push) Failing after 7m0s
Reviewed-on: #125
2026-03-08 04:47:02 +00:00
7bc861b3e8 ci: speed up Terraform destroy plan by skipping refresh
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 16s
Use terraform plan -refresh=false for destroy workflows so manual NUKE runs do not spend minutes refreshing Proxmox VM state before building the destroy plan.
2026-03-08 04:37:52 +00:00
7 changed files with 37 additions and 24 deletions

View File

@@ -27,7 +27,7 @@ jobs:
fi
- name: Checkout repository
uses: https://gitea.com/actions/checkout@v4
uses: actions/checkout@v4
- name: Create SSH key
run: |

View File

@@ -27,7 +27,7 @@ jobs:
fi
- name: Checkout repository
uses: https://gitea.com/actions/checkout@v4
uses: actions/checkout@v4
- name: Create SSH key
run: |

View File

@@ -16,7 +16,7 @@ jobs:
steps:
- name: Checkout repository
uses: https://gitea.com/actions/checkout@v4
uses: actions/checkout@v4
- name: Create secrets.tfvars
working-directory: terraform

View File

@@ -36,7 +36,7 @@ jobs:
fi
- name: Checkout repository
uses: https://gitea.com/actions/checkout@v4
uses: actions/checkout@v4
- name: Create Terraform secret files
working-directory: terraform
@@ -77,13 +77,13 @@ jobs:
set -euo pipefail
case "${{ inputs.target }}" in
all)
TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -out=tfdestroy"
TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -out=tfdestroy"
;;
control-planes)
TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy"
TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.control_planes -out=tfdestroy"
;;
workers)
TF_PLAN_CMD="terraform plan -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy"
TF_PLAN_CMD="terraform plan -refresh=false -parallelism=1 -destroy -target=proxmox_vm_qemu.workers -out=tfdestroy"
;;
*)
echo "Invalid destroy target: ${{ inputs.target }}"

View File

@@ -17,7 +17,7 @@ jobs:
steps:
- name: Checkout repository
uses: https://gitea.com/actions/checkout@v4
uses: actions/checkout@v4
- name: Create secrets.tfvars
working-directory: terraform

View File

@@ -110,7 +110,9 @@ class Controller:
"-o",
"IdentitiesOnly=yes",
"-o",
"StrictHostKeyChecking=accept-new",
"StrictHostKeyChecking=no",
"-o",
"UserKnownHostsFile=/dev/null",
"-i",
self.ssh_key,
]
@@ -121,6 +123,8 @@ class Controller:
self.fast_mode = self.env.get("FAST_MODE", "1")
self.skip_rebuild = self.env.get("SKIP_REBUILD", "0") == "1"
self.force_reinit = True
self.ssh_ready_retries = int(self.env.get("SSH_READY_RETRIES", "20"))
self.ssh_ready_delay = int(self.env.get("SSH_READY_DELAY_SEC", "15"))
def log(self, msg):
print(f"==> {msg}")
@@ -130,13 +134,26 @@ class Controller:
return run_local(full, check=check, capture=True)
def detect_user(self, ip):
for user in self.ssh_candidates:
proc = self._ssh(user, ip, "true", check=False)
if proc.returncode == 0:
self.active_ssh_user = user
self.log(f"Using SSH user '{user}' for {ip}")
return
raise RuntimeError(f"Unable to authenticate to {ip} with users: {', '.join(self.ssh_candidates)}")
for attempt in range(1, self.ssh_ready_retries + 1):
for user in self.ssh_candidates:
proc = self._ssh(user, ip, "true", check=False)
if proc.returncode == 0:
self.active_ssh_user = user
self.log(f"Using SSH user '{user}' for {ip}")
return
if attempt < self.ssh_ready_retries:
self.log(
f"SSH not ready on {ip} yet; retrying in {self.ssh_ready_delay}s "
f"({attempt}/{self.ssh_ready_retries})"
)
time.sleep(self.ssh_ready_delay)
raise RuntimeError(
"Unable to authenticate to "
f"{ip} with users: {', '.join(self.ssh_candidates)}. "
"If this is a freshly cloned VM, the Proxmox source template likely does not yet include the "
"current cloud-init-capable NixOS template configuration from nixos/template-base. "
"Terraform can only clone what exists in Proxmox; it cannot retrofit cloud-init support into an old template."
)
def remote(self, ip, cmd, check=True):
ordered = [self.active_ssh_user] + [u for u in self.ssh_candidates if u != self.active_ssh_user]
@@ -157,14 +174,7 @@ class Controller:
return last
def prepare_known_hosts(self):
ssh_dir = Path.home() / ".ssh"
ssh_dir.mkdir(parents=True, exist_ok=True)
(ssh_dir / "known_hosts").touch()
run_local(["chmod", "700", str(ssh_dir)])
run_local(["chmod", "600", str(ssh_dir / "known_hosts")])
for ip in self.node_ips.values():
run_local(["ssh-keygen", "-R", ip], check=False)
run_local(f"ssh-keyscan -H {shlex.quote(ip)} >> {shlex.quote(str(ssh_dir / 'known_hosts'))}", check=False)
pass
def prepare_remote_nix(self, ip):
self.remote(ip, "sudo mkdir -p /etc/nix")

View File

@@ -11,6 +11,7 @@ in
networking.hostName = "k8s-base-template";
networking.useDHCP = lib.mkDefault true;
networking.useNetworkd = true;
networking.nameservers = [ "1.1.1.1" "8.8.8.8" ];
boot.loader.systemd-boot.enable = lib.mkForce false;
@@ -20,6 +21,8 @@ in
};
services.qemuGuest.enable = true;
services.cloud-init.enable = true;
services.cloud-init.network.enable = true;
services.openssh.enable = true;
services.openssh.settings = {
PasswordAuthentication = false;