Commit Graph

20 Commits

Author SHA1 Message Date
90d105e5ea Fix kube_api_endpoint variable passing for HA cluster
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m18s
Deploy Cluster / Ansible (push) Successful in 8m55s
- Remove circular variable reference in site.yml
- Add kube_api_endpoint default to k3s-server role
- Variable is set via inventory group_vars and passed to role
- Primary CP now correctly adds LB IP to TLS SANs

Note: Existing cluster needs destroy/rebuild to regenerate certificates.
2026-03-23 03:01:53 +00:00
952a80a742 Fix HA cluster join via Load Balancer private IP
Some checks failed
Deploy Cluster / Terraform (push) Successful in 36s
Deploy Cluster / Ansible (push) Failing after 3m5s
Changes:
- Use LB private IP (10.0.1.5) instead of public IP for cluster joins
- Add LB private IP to k3s TLS SANs on primary control plane
- This allows secondary CPs and workers to verify certificates when joining via LB

Fixes x509 certificate validation error when joining via LB public IP.
2026-03-23 02:56:41 +00:00
ff31cb4e74 Implement HA control plane with Load Balancer (3-3 topology)
Some checks failed
Deploy Cluster / Terraform (push) Failing after 10s
Deploy Cluster / Ansible (push) Has been skipped
Major changes:
- Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33)
- Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API
- Terraform: Add kube_api_lb_ip output
- Ansible: Add community.network collection to requirements
- Ansible: Update inventory to include LB endpoint
- Ansible: Configure secondary CPs and workers to join via LB
- Ansible: Add k3s_join_endpoint variable for HA joins
- Workflow: Add imports for cp-2, cp-3, and worker-3
- Docs: Update STABLE_BASELINE.md with HA topology and phase gates

Topology:
- 3 control planes (cx23 - 2 vCPU, 8GB RAM each)
- 3 workers (cx33 - 4 vCPU, 16GB RAM each)
- 1 Load Balancer (lb11) routing to all 3 control planes on port 6443
- Workers and secondary CPs join via LB endpoint for HA

Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)
2026-03-23 02:39:39 +00:00
31b82c9371 Deploy CCM via Ansible before workers join to fix external cloud provider
Some checks failed
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 1m48s
This fixes the chicken-and-egg problem where workers with
--kubelet-arg=cloud-provider=external couldn't join because CCM wasn't
running yet to remove the node.cloudprovider.kubernetes.io/uninitialized taint.

Changes:
- Create ansible/roles/ccm-deploy/ to deploy CCM via Helm during Ansible phase
- Reorder site.yml: CCM deploys after secrets but before workers join
- CCM runs on control_plane[0] with proper tolerations for control plane nodes
- Add 10s pause after CCM ready to ensure it can process new nodes
- Workers can now successfully join with external cloud provider enabled

Flux still manages CCM for updates, but initial install happens in Ansible.
2026-03-22 23:58:03 +00:00
08a3031276 refactor: retire imperative addon roles
All checks were successful
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 4m2s
2026-03-17 01:04:02 +00:00
bed8e4afc8 feat: migrate core addons toward flux
All checks were successful
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 4m6s
2026-03-11 17:43:35 +00:00
6f2e056b98 feat: sync runtime secrets from doppler
All checks were successful
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Successful in 9m56s
2026-03-09 00:25:41 +00:00
f95e0051a5 feat: automate private tailnet access on cp1
All checks were successful
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Successful in 9m45s
2026-03-08 04:16:06 +00:00
86fb5d5b90 fix: move observability gitops gating to role level
All checks were successful
Deploy Cluster / Terraform (push) Successful in 44s
Deploy Cluster / Ansible (push) Successful in 9m17s
2026-03-05 00:17:25 +00:00
8b403cd1d6 feat: migrate observability stack to flux gitops
Some checks failed
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Failing after 1m11s
2026-03-04 23:38:40 +00:00
2f166ed9e7 feat: manage grafana content as code with fast dashboard workflow
Some checks failed
Deploy Cluster / Terraform (push) Successful in 46s
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Grafana Content / Grafana Content (push) Has been cancelled
2026-03-04 03:36:01 +00:00
a0ed6523ec feat: add Tailscale Kubernetes Operator for Grafana/Prometheus access
Some checks failed
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-02 20:28:51 +00:00
b30977a158 feat: deploy lightweight observability stack via Ansible
Some checks failed
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-02 01:33:41 +00:00
2bc9749b81 feat: switch kubeconfig to tailnet endpoint and deploy Hetzner CSI
All checks were successful
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Successful in 3m12s
2026-03-01 17:12:12 +00:00
b5b8f89dc2 fix: derive k3s node IPs from terraform private addresses
Some checks failed
Deploy Cluster / Terraform (push) Successful in 18s
Deploy Cluster / Ansible (push) Failing after 3m9s
2026-03-01 03:08:56 +00:00
b703cb269b fix: bootstrap k3s HA on private network with dual SANs
Some checks failed
Deploy Cluster / Terraform (push) Successful in 2m31s
Deploy Cluster / Ansible (push) Failing after 4m38s
2026-03-01 02:45:00 +00:00
64dfbf7315 fix: use primary public IP for k3s join to match existing API cert SAN
Some checks failed
Deploy Cluster / Terraform (push) Successful in 18s
Deploy Cluster / Ansible (push) Failing after 17m50s
2026-03-01 02:25:13 +00:00
27b29322cd fix: use private network IPs for k3s join and node addressing
Some checks failed
Deploy Cluster / Terraform (push) Successful in 24s
Deploy Cluster / Ansible (push) Failing after 8m13s
2026-03-01 00:42:55 +00:00
1db435cd42 fix: Use private IP for k3s HA cluster join and advertise
Some checks failed
Deploy Cluster / Terraform (push) Successful in 19s
Deploy Cluster / Ansible (push) Failing after 8m11s
2026-03-01 00:32:03 +00:00
3b3084b997 feat: Add HA Kubernetes cluster with Terraform + Ansible
Some checks failed
Terraform / Validate (push) Failing after 17s
Terraform / Plan (push) Has been skipped
Terraform / Apply (push) Has been skipped
- 3x CX23 control plane nodes (HA)
- 4x CX33 worker nodes
- k3s with embedded etcd
- Hetzner CCM for load balancers
- Gitea CI/CD workflows
- Backblaze B2 for Terraform state
2026-02-28 20:24:55 +00:00