refactor: simplify homelab bootstrap around static IPs and fresh runs
Some checks failed
Terraform Plan / Terraform Plan (push) Failing after 10s

Make Terraform the source of truth for node IPs, remove guest-agent/SSH discovery from the normal workflow path, simplify the bootstrap controller to a fresh-run flow, and swap the initial CNI to Flannel so cluster readiness is easier to prove before reintroducing more complex reconcile behavior.
This commit is contained in:
2026-03-07 00:52:35 +00:00
parent e06b2c692e
commit a0b07816b9
9 changed files with 78 additions and 177 deletions

View File

@@ -50,7 +50,7 @@ sudo nixos-rebuild switch --flake .#cp-1
For remote target-host workflows, use your preferred deploy wrapper later
(`nixos-rebuild --target-host ...` or deploy-rs/colmena).
## Bootstrap runbook (kubeadm + kube-vip + Cilium)
## Bootstrap runbook (kubeadm + kube-vip + Flannel)
1. Apply Nix config on all nodes (`cp-*`, then `wk-*`).
2. On `cp-1`, run:
@@ -62,14 +62,10 @@ sudo th-kubeadm-init
This infers the control-plane VIP as `<node-subnet>.250` on `eth0`, creates the
kube-vip static pod manifest, and runs `kubeadm init`.
3. Install Cilium from `cp-1`:
3. Install Flannel from `cp-1`:
```bash
helm repo add cilium https://helm.cilium.io
helm repo update
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=true
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/v0.25.5/Documentation/kube-flannel.yml
```
4. Generate join commands on `cp-1`:
@@ -98,7 +94,7 @@ kubectl get nodes -o wide
kubectl -n kube-system get pods -o wide
```
## Repeatable rebuild flow (recommended)
## Fresh bootstrap flow (recommended)
1. Copy and edit inventory:
@@ -107,7 +103,7 @@ cp ./scripts/inventory.example.env ./scripts/inventory.env
$EDITOR ./scripts/inventory.env
```
2. Rebuild all nodes and bootstrap/reconcile cluster:
2. Rebuild all nodes and bootstrap a fresh cluster:
```bash
./scripts/rebuild-and-bootstrap.sh
@@ -141,15 +137,15 @@ For a full nuke/recreate lifecycle:
- run Terraform destroy/apply for VMs first,
- then run `./scripts/rebuild-and-bootstrap.sh` again.
Node lists are discovered from Terraform outputs, so adding new workers/control
planes in Terraform is picked up automatically by the bootstrap/reconcile flow.
Node lists now come directly from static Terraform outputs, so bootstrap no longer
depends on Proxmox guest-agent IP discovery or SSH subnet scanning.
## Optional Gitea workflow automation
Primary flow:
- Push to `master` triggers `.gitea/workflows/terraform-apply.yml`
- That workflow now does Terraform apply and then runs kubeadm rebuild/bootstrap reconciliation automatically
- That workflow now does Terraform apply and then runs a fresh kubeadm bootstrap automatically
Manual dispatch workflows are available:
@@ -164,9 +160,7 @@ Required repository secrets:
Optional secrets:
- `KUBEADM_SSH_USER` (defaults to `micqdf`)
- `KUBEADM_SUBNET_PREFIX` (optional, e.g. `10.27.27`; used for SSH-based IP discovery fallback)
Node IPs are auto-discovered from Terraform state outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets.
Node IPs are rendered directly from static Terraform outputs (`control_plane_vm_ipv4`, `worker_vm_ipv4`), so you do not need per-node IP secrets or SSH discovery fallbacks.
## Notes