Commit Graph

177 Commits

Author SHA1 Message Date
e47ec2a3e7 Update Weave GitOps to v0.41.0 to support HelmRelease v2 API
All checks were successful
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Successful in 4m30s
Fixes error: 'no matches for kind HelmRelease in version v2beta1'

The cluster uses HelmRelease v2 API but Weave GitOps v0.38.0 was looking
for the old v2beta1 API. Updated image tag to v0.41.0 which supports
the newer API version.
2026-03-24 01:33:10 +00:00
45c899d2bd Configure Weave GitOps to use Doppler-managed admin credentials
All checks were successful
Deploy Cluster / Terraform (push) Successful in 39s
Deploy Cluster / Ansible (push) Successful in 4m41s
Changes:
- Enable adminUser creation but disable Helm-managed secret
- Use ExternalSecret (cluster-user-auth) from Doppler instead
- Doppler secrets: WEAVE_GITOPS_ADMIN_USERNAME and WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH
- Added cluster-user-auth to viewSecretsResourceNames for RBAC

Login credentials are now managed via Doppler and External Secrets Operator.
2026-03-24 01:01:30 +00:00
0e52d8f159 Use Tailscale DNS names instead of IPs for TLS SANs
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m21s
Deploy Cluster / Ansible (push) Successful in 9m0s
Changed from hardcoded Tailscale IPs to DNS names:
- k8s-cluster-cp-1.silverside-gopher.ts.net
- k8s-cluster-cp-2.silverside-gopher.ts.net
- k8s-cluster-cp-3.silverside-gopher.ts.net

This is more robust since Tailscale IPs change on rebuild,
but DNS names remain consistent.

After next rebuild, cluster accessible via:
- kubectl --server=https://k8s-cluster-cp-1.silverside-gopher.ts.net:6443
2026-03-23 23:50:48 +00:00
4726db2b5b Add Tailscale IPs to k3s TLS SANs for secure tailnet access
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m30s
Deploy Cluster / Ansible (push) Successful in 9m48s
Changes:
- Add tailscale_control_plane_ips list to k3s-server defaults
- Include all 3 control plane Tailscale IPs (100.120.55.97, 100.108.90.123, 100.92.149.85)
- Update primary k3s install to add Tailscale IPs to TLS certificates
- Enables kubectl access via Tailscale without certificate errors

After next deploy, cluster will be accessible via:
- kubectl --server=https://100.120.55.97:6443 (or any CP tailscale IP)
- kubectl --server=https://k8s-cluster-cp-1:6443 (via tailscale DNS)
2026-03-23 23:04:00 +00:00
90d105e5ea Fix kube_api_endpoint variable passing for HA cluster
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m18s
Deploy Cluster / Ansible (push) Successful in 8m55s
- Remove circular variable reference in site.yml
- Add kube_api_endpoint default to k3s-server role
- Variable is set via inventory group_vars and passed to role
- Primary CP now correctly adds LB IP to TLS SANs

Note: Existing cluster needs destroy/rebuild to regenerate certificates.
2026-03-23 03:01:53 +00:00
952a80a742 Fix HA cluster join via Load Balancer private IP
Some checks failed
Deploy Cluster / Terraform (push) Successful in 36s
Deploy Cluster / Ansible (push) Failing after 3m5s
Changes:
- Use LB private IP (10.0.1.5) instead of public IP for cluster joins
- Add LB private IP to k3s TLS SANs on primary control plane
- This allows secondary CPs and workers to verify certificates when joining via LB

Fixes x509 certificate validation error when joining via LB public IP.
2026-03-23 02:56:41 +00:00
4965017b86 Fix Load Balancer network attachment
Some checks failed
Deploy Cluster / Terraform (push) Successful in 54s
Deploy Cluster / Ansible (push) Failing after 3m44s
Add hcloud_load_balancer_network resource to attach LB to private network.
This is required before targets can use use_private_ip=true.
LB gets IP 10.0.1.5 on the private network.
2026-03-23 02:44:35 +00:00
b2b9c38b91 Fix Load Balancer output attribute - use ipv4 instead of ipv4_address
Some checks failed
Deploy Cluster / Terraform (push) Failing after 1m37s
Deploy Cluster / Ansible (push) Has been skipped
2026-03-23 02:40:50 +00:00
ff31cb4e74 Implement HA control plane with Load Balancer (3-3 topology)
Some checks failed
Deploy Cluster / Terraform (push) Failing after 10s
Deploy Cluster / Ansible (push) Has been skipped
Major changes:
- Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33)
- Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API
- Terraform: Add kube_api_lb_ip output
- Ansible: Add community.network collection to requirements
- Ansible: Update inventory to include LB endpoint
- Ansible: Configure secondary CPs and workers to join via LB
- Ansible: Add k3s_join_endpoint variable for HA joins
- Workflow: Add imports for cp-2, cp-3, and worker-3
- Docs: Update STABLE_BASELINE.md with HA topology and phase gates

Topology:
- 3 control planes (cx23 - 2 vCPU, 8GB RAM each)
- 3 workers (cx33 - 4 vCPU, 16GB RAM each)
- 1 Load Balancer (lb11) routing to all 3 control planes on port 6443
- Workers and secondary CPs join via LB endpoint for HA

Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)
2026-03-23 02:39:39 +00:00
8b4a445b37 Update STABLE_BASELINE.md - CCM/CSI integration achieved
All checks were successful
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Successful in 3m36s
Document the successful completion of Hetzner CCM and CSI integration:
- CCM deployed via Ansible before workers join (fixes uninitialized taint)
- CSI provides hcloud-volumes StorageClass for persistent storage
- Two consecutive rebuilds passed all phase gates
- PVC provisioning tested and working

Platform now has full cloud provider integration with persistent volumes.
2026-03-23 02:25:00 +00:00
e447795395 Install helm binary in ccm-deploy role before using it
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m1s
Deploy Cluster / Ansible (push) Successful in 6m35s
The kubernetes.core.helm module requires helm CLI to be installed on
the target node. Added check and install step using the official
helm install script.
2026-03-23 00:07:39 +00:00
31b82c9371 Deploy CCM via Ansible before workers join to fix external cloud provider
Some checks failed
Deploy Cluster / Terraform (push) Successful in 31s
Deploy Cluster / Ansible (push) Failing after 1m48s
This fixes the chicken-and-egg problem where workers with
--kubelet-arg=cloud-provider=external couldn't join because CCM wasn't
running yet to remove the node.cloudprovider.kubernetes.io/uninitialized taint.

Changes:
- Create ansible/roles/ccm-deploy/ to deploy CCM via Helm during Ansible phase
- Reorder site.yml: CCM deploys after secrets but before workers join
- CCM runs on control_plane[0] with proper tolerations for control plane nodes
- Add 10s pause after CCM ready to ensure it can process new nodes
- Workers can now successfully join with external cloud provider enabled

Flux still manages CCM for updates, but initial install happens in Ansible.
2026-03-22 23:58:03 +00:00
cadfedacf1 Fix providerID health check - use shell module for piped grep
Some checks failed
Deploy Cluster / Terraform (push) Successful in 1m47s
Deploy Cluster / Ansible (push) Failing after 18m4s
2026-03-22 22:55:55 +00:00
561cd67b0c Enable Hetzner CCM and CSI for cloud provider integration
Some checks failed
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 3m21s
- Enable --kubelet-arg=cloud-provider=external on all nodes (control planes and workers)
- Activate CCM Kustomization with 10m timeout for Hetzner cloud-controller-manager
- Activate CSI Kustomization with dependsOn CCM and 10m timeout for hcloud-csi
- Update deploy workflow to wait for CCM/CSI readiness (600s timeout)
- Add providerID verification to post-deploy health checks

This enables proper cloud provider integration with Hetzner CCM for node
labeling and Hetzner CSI for persistent volume provisioning.
2026-03-22 22:26:21 +00:00
4eebbca648 docs: update README for deferred observability baseline
All checks were successful
Deploy Cluster / Terraform (push) Successful in 1m41s
Deploy Cluster / Ansible (push) Successful in 5m37s
2026-03-22 01:04:53 +00:00
7b5d794dfc fix: update health checks for deferred observability
Some checks failed
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-22 01:04:27 +00:00
8643bbfc12 fix: defer observability to get clean baseline
Some checks failed
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-22 01:03:55 +00:00
84f446c2e6 fix: restore observability timeouts to 5 minutes
Some checks failed
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 8m38s
2026-03-22 00:43:37 +00:00
d446e86ece fix: use static grafana password, remove externalsecret dependency
Some checks failed
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-22 00:43:21 +00:00
90c7f565e0 fix: remove tailscale ingress dependencies from observability
Some checks failed
Deploy Cluster / Terraform (push) Successful in 39s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-22 00:42:35 +00:00
989848fa89 fix: increase observability timeouts to 10 minutes
Some checks failed
Deploy Cluster / Terraform (push) Successful in 2m1s
Deploy Cluster / Ansible (push) Failing after 13m54s
2026-03-21 19:34:43 +00:00
56e5807474 fix: create doppler ClusterSecretStore after ESO is installed
Some checks failed
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Failing after 8m31s
2026-03-21 19:19:43 +00:00
df0511148c fix: unsuspend tailscale operator for stable baseline
Some checks failed
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Failing after 8m44s
2026-03-21 19:03:39 +00:00
894e6275b1 docs: update stable baseline to defer ccm/csi
Some checks failed
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 8m35s
2026-03-21 18:41:36 +00:00
a01cf435d4 fix: skip ccm/csi waits for stable baseline - using k3s embedded
Some checks failed
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-21 18:40:53 +00:00
84f77c4a68 fix: use kubectl patch instead of apply for flux controller nodeSelector
Some checks failed
Deploy Cluster / Terraform (push) Successful in 38s
Deploy Cluster / Ansible (push) Failing after 9m41s
2026-03-21 18:05:41 +00:00
2e4196688c fix: bootstrap flux in phases - crds first, then resources
Some checks failed
Deploy Cluster / Terraform (push) Successful in 38s
Deploy Cluster / Ansible (push) Failing after 3m19s
2026-03-21 17:42:39 +00:00
8d1f9f4944 fix: add k3s reset logic for primary control plane
Some checks failed
Deploy Cluster / Terraform (push) Successful in 39s
Deploy Cluster / Ansible (push) Failing after 4m19s
2026-03-21 16:10:17 +00:00
d4fd43e2f5 refactor: simplify k3s-server bootstrap for 2026-03-21 15:48:33 +00:00
48a80c362c fix: disable external cloud-provider kubelet arg for stable baseline
Some checks failed
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Failing after 4m21s
2026-03-21 14:36:54 +00:00
fcf7f139ff fix: use public api endpoint for flux bootstrap
Some checks failed
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Failing after 2m16s
2026-03-21 00:07:51 +00:00
7139ae322d fix: bootstrap flux during cluster deploy
Some checks failed
Deploy Cluster / Terraform (push) Successful in 38s
Deploy Cluster / Ansible (push) Failing after 3m21s
2026-03-20 10:37:11 +00:00
528a8dc210 fix: defer doppler store until eso is installed
Some checks failed
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Failing after 24m34s
2026-03-20 09:30:17 +00:00
349f75729a fix: bootstrap tailscale namespace before secret
Some checks failed
Deploy Cluster / Terraform (push) Successful in 44s
Deploy Cluster / Ansible (push) Failing after 3m30s
2026-03-20 09:24:35 +00:00
522626a52b refactor: simplify stable cluster baseline
Some checks failed
Deploy Cluster / Terraform (push) Successful in 1m48s
Deploy Cluster / Ansible (push) Failing after 4m7s
2026-03-20 02:24:37 +00:00
5bd4c41c2d fix: restore k3s agent bootstrap
Some checks failed
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Failing after 18m16s
2026-03-20 01:50:16 +00:00
3e41f71b1b fix: harden terraform destroy workflow
Some checks failed
Deploy Cluster / Terraform (push) Successful in 2m28s
Deploy Cluster / Ansible (push) Failing after 20m4s
2026-03-19 23:26:03 +00:00
9d2f30de32 fix: prepare k3s for external cloud provider
All checks were successful
Deploy Cluster / Terraform (push) Successful in 46s
Deploy Cluster / Ansible (push) Successful in 4m4s
2026-03-17 01:21:23 +00:00
08a3031276 refactor: retire imperative addon roles
All checks were successful
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 4m2s
2026-03-17 01:04:02 +00:00
e3ce91db62 fix: align flux ccm with live deployment
All checks were successful
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Successful in 3m56s
2026-03-11 18:17:16 +00:00
bed8e4afc8 feat: migrate core addons toward flux
All checks were successful
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 4m6s
2026-03-11 17:43:35 +00:00
2d4de6cff8 fix: bootstrap doppler store outside flux
All checks were successful
Deploy Cluster / Terraform (push) Successful in 43s
Deploy Cluster / Ansible (push) Successful in 9m42s
2026-03-09 02:58:26 +00:00
4a83d981c8 fix: skip dry-run validation for doppler store sync
Some checks failed
Deploy Cluster / Terraform (push) Successful in 44s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-09 02:52:08 +00:00
d188a51ef6 fix: move doppler store manifests out of ignored path
Some checks failed
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-09 02:45:46 +00:00
646ef16258 fix: stabilize flux and external secrets reconciliation
All checks were successful
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Successful in 9m42s
2026-03-09 02:25:27 +00:00
6f2e056b98 feat: sync runtime secrets from doppler
All checks were successful
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Successful in 9m56s
2026-03-09 00:25:41 +00:00
e10a70475f fix: right-size flux observability workloads
All checks were successful
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Successful in 9m37s
2026-03-08 05:17:22 +00:00
f95e0051a5 feat: automate private tailnet access on cp1
All checks were successful
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Successful in 9m45s
2026-03-08 04:16:06 +00:00
7c15ac5846 feat: add flux ui on shared tailscale endpoint
All checks were successful
Deploy Cluster / Terraform (push) Successful in 46s
Deploy Cluster / Ansible (push) Successful in 9m40s
2026-03-07 12:30:17 +00:00
4c104f74e8 feat: route observability through one tailscale endpoint
All checks were successful
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Successful in 9m33s
2026-03-07 01:04:03 +00:00