refactor: add Python bootstrap controller with resumable state
All checks were successful
Terraform Plan / Terraform Plan (push) Successful in 17s

Introduce a clean orchestration layer in nixos/kubeadm/bootstrap/controller.py and slim rebuild-and-bootstrap.sh into a thin wrapper. The controller now owns preflight, rebuild, init, CNI install, join, and verify stages with persisted checkpoints on cp-1 plus a local state copy for CI debugging.
This commit is contained in:
2026-03-03 00:09:10 +00:00
parent 355273add5
commit 6fecfb3ee6
3 changed files with 451 additions and 339 deletions

View File

@@ -13,11 +13,19 @@ This folder defines role-based NixOS configs for a kubeadm cluster.
- Shared cluster defaults in `modules/k8s-cluster-settings.nix`
- Role-specific settings for control planes and workers
- Generated per-node host configs from `flake.nix` (no duplicated host files)
- Bootstrap helper commands:
- Bootstrap helper commands on each node:
- `th-kubeadm-init`
- `th-kubeadm-join-control-plane`
- `th-kubeadm-join-worker`
- `th-kubeadm-status`
- A Python bootstrap controller for orchestration:
- `bootstrap/controller.py`
## Layered architecture
- `terraform/`: VM lifecycle only
- `nixos/kubeadm/modules/`: declarative node OS config only
- `nixos/kubeadm/bootstrap/controller.py`: imperative cluster reconciliation state machine
## Hardware config files
@@ -114,6 +122,15 @@ FAST_MODE=1 WORKER_PARALLELISM=3 REBUILD_TIMEOUT=45m REBUILD_RETRIES=2 ./scripts
- `FAST_MODE=1` skips pre-rebuild remote GC cleanup to reduce wall-clock time.
- Set `FAST_MODE=0` for a slower but more aggressive space cleanup pass.
### Bootstrap controller state
The controller stores checkpoints in both places:
- Remote (source of truth): `/var/lib/terrahome/bootstrap-state.json` on `cp-1`
- Local copy (workflow/debug artifact): `nixos/kubeadm/bootstrap/bootstrap-state-last.json`
This makes retries resumable and keeps failure context visible from CI.
3. If you only want to reset Kubernetes state on existing VMs:
```bash