Files

MichaelFisher1997 4c167f618a

Terraform Plan / Terraform Plan (push) Successful in 17s

Details

fix: wait for SSH readiness after VM provisioning

Freshly recreated VMs can take a few minutes before cloud-init users and SSH are available. Retry SSH authentication in the bootstrap controller before failing so rebuild/bootstrap does not abort immediately on new hosts.

2026-03-08 05:00:39 +00:00

bootstrap

fix: wait for SSH readiness after VM provisioning

2026-03-08 05:00:39 +00:00

hosts/hardware

refactor: generate kubeadm host configs from flake

2026-02-28 16:09:05 +00:00

manifests

fix: vendor Flannel manifest and harden CNI bootstrap timing

2026-03-08 03:24:16 +00:00

modules

fix: vendor Flannel manifest and harden CNI bootstrap timing

2026-03-08 03:24:16 +00:00

scripts

fix: add heuristic SSH inventory fallback for generic hostnames

2026-03-04 23:07:45 +00:00

flake.lock

fix: escape shell expansion in kubeadm helper scripts

2026-02-28 16:12:25 +00:00

flake.nix

chore: add lightweight flake checks for kubeadm configs

2026-02-28 16:19:37 +00:00

README.md

refactor: simplify homelab bootstrap around static IPs and fresh runs

2026-03-07 00:52:35 +00:00

README.md

Kubeadm Cluster Layout (NixOS)

This folder defines role-based NixOS configs for a kubeadm cluster.

Topology

Control planes: cp-1, cp-2, cp-3
Workers: wk-1, wk-2, wk-3

What this provides

Shared Kubernetes/node prerequisites in modules/k8s-common.nix
Shared cluster defaults in modules/k8s-cluster-settings.nix
Role-specific settings for control planes and workers
Generated per-node host configs from flake.nix (no duplicated host files)
Bootstrap helper commands on each node:
- th-kubeadm-init
- th-kubeadm-join-control-plane
- th-kubeadm-join-worker
- th-kubeadm-status
A Python bootstrap controller for orchestration:
- bootstrap/controller.py

Layered architecture

terraform/: VM lifecycle only
nixos/kubeadm/modules/: declarative node OS config only
nixos/kubeadm/bootstrap/controller.py: imperative cluster reconciliation state machine

Hardware config files

The flake automatically imports hosts/hardware/<host>.nix if present. Copy each node's generated hardware config into this folder:

sudo nixos-generate-config
sudo cp /etc/nixos/hardware-configuration.nix ./hosts/hardware/cp-1.nix

Repeat for each node (cp-2, cp-3, wk-1, wk-2, wk-3).

Deploy approach

Start from one node at a time while experimenting:

sudo nixos-rebuild switch --flake .#cp-1

For remote target-host workflows, use your preferred deploy wrapper later (nixos-rebuild --target-host ... or deploy-rs/colmena).

Bootstrap runbook (kubeadm + kube-vip + Flannel)

Apply Nix config on all nodes (cp-*, then wk-*).
On cp-1, run:

sudo th-kubeadm-init

This infers the control-plane VIP as <node-subnet>.250 on eth0, creates the kube-vip static pod manifest, and runs kubeadm init.

Install Flannel from cp-1:

kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/v0.25.5/Documentation/kube-flannel.yml

Generate join commands on cp-1:

sudo kubeadm token create --print-join-command
sudo kubeadm init phase upload-certs --upload-certs

Join cp-2 and cp-3:

sudo th-kubeadm-join-control-plane '<kubeadm join ... --control-plane --certificate-key ...>'

Join workers:

sudo th-kubeadm-join-worker '<kubeadm join ...>'

Validate from a control plane:

kubectl get nodes -o wide
kubectl -n kube-system get pods -o wide

Fresh bootstrap flow (recommended)

Copy and edit inventory:

cp ./scripts/inventory.example.env ./scripts/inventory.env
$EDITOR ./scripts/inventory.env

Rebuild all nodes and bootstrap a fresh cluster:

./scripts/rebuild-and-bootstrap.sh

Optional tuning env vars:

FAST_MODE=1 WORKER_PARALLELISM=3 REBUILD_TIMEOUT=45m REBUILD_RETRIES=2 ./scripts/rebuild-and-bootstrap.sh

FAST_MODE=1 skips pre-rebuild remote GC cleanup to reduce wall-clock time.
Set FAST_MODE=0 for a slower but more aggressive space cleanup pass.

Bootstrap controller state

The controller stores checkpoints in both places:

Remote (source of truth): /var/lib/terrahome/bootstrap-state.json on cp-1
Local copy (workflow/debug artifact): nixos/kubeadm/bootstrap/bootstrap-state-last.json

This makes retries resumable and keeps failure context visible from CI.

If you only want to reset Kubernetes state on existing VMs:

./scripts/reset-cluster-nodes.sh

For a full nuke/recreate lifecycle:

run Terraform destroy/apply for VMs first,
then run ./scripts/rebuild-and-bootstrap.sh again.

Node lists now come directly from static Terraform outputs, so bootstrap no longer depends on Proxmox guest-agent IP discovery or SSH subnet scanning.

Optional Gitea workflow automation

Primary flow:

Push to master triggers .gitea/workflows/terraform-apply.yml
That workflow now does Terraform apply and then runs a fresh kubeadm bootstrap automatically

Manual dispatch workflows are available:

.gitea/workflows/kubeadm-bootstrap.yml
.gitea/workflows/kubeadm-reset.yml

Required repository secrets:

Existing Terraform/backend secrets used by current workflows (B2_*, PM_API_TOKEN_SECRET, SSH_KEY_PUBLIC)
SSH private key: prefer KUBEADM_SSH_PRIVATE_KEY, fallback to existing SSH_KEY_PRIVATE

Optional secrets:

KUBEADM_SSH_USER (defaults to micqdf) Node IPs are rendered directly from static Terraform outputs (control_plane_vm_ipv4, worker_vm_ipv4), so you do not need per-node IP secrets or SSH discovery fallbacks.

Notes

Scripts are intentionally manual-triggered (predictable for homelab bring-up).
If .250 on the node subnet is already in use, change controlPlaneVipSuffix in modules/k8s-cluster-settings.nix before bootstrap.