Commit Graph

218 Commits

Author SHA1 Message Date
cc14e32572 fix: Use gzip instead of lzop for backup compression
Some checks failed
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 22:51:10 +00:00
a207a5a7fd fix: Remove invalid encryption field from CNP backup config
Some checks failed
Deploy Cluster / Terraform (push) Successful in 40s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 22:49:29 +00:00
4e1772c175 feat: Add B2 backup configuration to CNP Cluster
Some checks failed
Deploy Cluster / Terraform (push) Successful in 1m38s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 22:47:31 +00:00
ff70b12084 chore: Add HTTP/HTTPS firewall rules for Load Balancer
All checks were successful
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 6m56s
2026-03-26 22:36:13 +00:00
a3963c56e6 cleanup: Remove traefik-config, simplify traefik helmrelease
All checks were successful
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Successful in 6m20s
2026-03-26 03:16:56 +00:00
612435c42c fix: Add Hetzner LB health check config to Traefik
Some checks failed
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 03:11:10 +00:00
ac42f671a2 fix: Remove addon-traefik-config dependency from flux-ui
Some checks failed
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 03:05:58 +00:00
dbe7ec0468 fix: Remove expose boolean from traefik ports config
Some checks failed
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 03:01:13 +00:00
816ac8b3c0 fix: Use official Traefik helm repo instead of rancher-stable
Some checks failed
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 02:59:00 +00:00
6f7998639f fix: Use standard kustomize API in traefik addon
Some checks failed
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 02:56:52 +00:00
7a14f89ad1 fix: Correct traefik kustomization path and sourceRef
Some checks failed
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 02:55:37 +00:00
786901c5d7 fix: Correct traefik kustomization reference (directory not file)
Some checks failed
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 02:54:29 +00:00
46f3d1130b feat: Add Flux-managed Traefik HelmRelease with Hetzner LB config
Some checks failed
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 02:52:49 +00:00
2fe5a626d4 fix: Add Hetzner network zone annotation to Traefik LoadBalancer
All checks were successful
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 6m20s
2026-03-26 02:30:43 +00:00
2ef68c8087 fix: Remove deprecated enablePodMonitor field in CNP Cluster
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m13s
Deploy Cluster / Ansible (push) Successful in 10m15s
2026-03-26 01:01:53 +00:00
e2cae18f5f fix: Remove backup config for initial deployment - add backup after DB is running
All checks were successful
Deploy Cluster / Terraform (push) Successful in 36s
Deploy Cluster / Ansible (push) Successful in 4m56s
2026-03-26 00:46:50 +00:00
e0c1e41ee9 fix: Remove bootstrap recovery - create fresh DB (recovery only needed after first backup)
Some checks failed
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 00:43:49 +00:00
63533de901 fix: Fix retentionPolicy format (14d not keep14)
Some checks failed
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 00:41:44 +00:00
1b39710f63 fix: Move retentionPolicy to correct location in backup spec
Some checks failed
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 00:39:25 +00:00
8c034323dc fix: Fix Cluster CR with correct barmanObjectStore schema
Some checks failed
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 00:35:23 +00:00
5fa2b411ee fix: Fix Cluster CR schema - use barmanObjectStore instead of b2
Some checks failed
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 00:33:04 +00:00
3ea28e525f fix: Fix CNP operator image repository (cloudnative-pg not postgresql)
All checks were successful
Deploy Cluster / Terraform (push) Successful in 43s
Deploy Cluster / Ansible (push) Successful in 4m55s
2026-03-26 00:21:09 +00:00
4b95ba113d fix: Remove LPP helm (already installed by k3s), fix CNP chart version to 0.27.1
All checks were successful
Deploy Cluster / Terraform (push) Successful in 36s
Deploy Cluster / Ansible (push) Successful in 5m7s
2026-03-26 00:13:22 +00:00
13627bf81f fix: Split CNP operator from CNP cluster to fix CRD dependency
All checks were successful
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Successful in 5m0s
- Move CNP operator HelmRelease to cnpg-operator folder
- Create addon-cnpg-operator kustomization (deploys operator first)
- Update addon-cnpg to dependOn addon-cnpg-operator
- Add addon-cnpg as dependency for addon-rancher (needs database)
2026-03-26 00:06:34 +00:00
ef3fb2489a fix: Convert kustomization-lpp and kustomization-cnpg to Flux Kustomization CRs
Some checks failed
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-26 00:03:53 +00:00
7097495d72 fix: Add missing metadata.name to kustomization-lpp and kustomization-cnpg
Some checks failed
Deploy Cluster / Terraform (push) Successful in 1m7s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-25 23:39:45 +00:00
9d601dc77c feat: Add CloudNativePG with B2 backups for persistent Rancher database
Some checks failed
Deploy Cluster / Terraform (push) Successful in 4m16s
Deploy Cluster / Ansible (push) Failing after 12m27s
- Add Local Path Provisioner for storage
- Add CloudNativePG operator (v1.27.0) via Flux
- Create PostgreSQL cluster with B2 (Backblaze) auto-backup/restore
- Update Rancher to use external PostgreSQL via CATTLE_DB_CATTLE_* env vars
- Add weekly pg_dump CronJob to B2 (Sundays 2AM)
- Add pre-destroy backup hook to destroy workflow
- Add B2 credentials to Doppler (B2_ACCOUNT_ID, B2_APPLICATION_KEY)
- Generate RANCHER_DB_PASSWORD in Doppler

Backup location: HetznerTerra/rancher-backups/
Retention: 14 backups
2026-03-25 23:06:45 +00:00
f36445d99a Fix CNI: configure flannel to use private network interface (enp7s0) instead of public
All checks were successful
Deploy Cluster / Terraform (push) Successful in 34s
Deploy Cluster / Ansible (push) Successful in 8m42s
2026-03-25 01:44:33 +00:00
89c2c99963 Fix Rancher: remove conflicting LoadBalancer, add HTTPS port-forward, use tailscale serve only
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m21s
Deploy Cluster / Ansible (push) Successful in 9m2s
2026-03-25 00:59:16 +00:00
4a35cfb549 Fix Rancher: use correct targetPort 444 for HTTPS
Some checks failed
Deploy Cluster / Terraform (push) Successful in 43s
Deploy Cluster / Ansible (push) Failing after 18m56s
2026-03-24 23:30:58 +00:00
3d50bfc534 Fix Rancher service selector: use cattle-system-rancher label
Some checks failed
Deploy Cluster / Terraform (push) Successful in 44s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-24 23:25:36 +00:00
ab2f287bfb Fix Rancher: use correct service name cattle-system-rancher
All checks were successful
Deploy Cluster / Terraform (push) Successful in 39s
Deploy Cluster / Ansible (push) Successful in 4m23s
2026-03-24 22:30:49 +00:00
dcb2675b67 Upgrade Rancher to 2.13.3 for K8s 1.34 compatibility
All checks were successful
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Successful in 4m13s
2026-03-24 21:42:51 +00:00
b40bec7e0e Fix Rancher: use Doppler secret instead of hardcoded password
All checks were successful
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Successful in 4m12s
2026-03-24 21:13:23 +00:00
efe0c0cfd5 Fix Rancher: upgrade to 2.10.3 for K8s 1.34 compatibility
All checks were successful
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Successful in 4m20s
2026-03-24 20:29:38 +00:00
c61d9f9c1d Remove traefik-config dependency from Rancher
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m5s
Deploy Cluster / Ansible (push) Successful in 8m18s
2026-03-24 20:02:08 +00:00
60ceac4624 Fix Rancher access: add kubectl port-forward + tailscale serve setup
Some checks failed
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-24 20:01:57 +00:00
47b384a337 Fix Rancher access: add Tailscale service for Traefik with port 9442, fix deployment order
All checks were successful
Deploy Cluster / Terraform (push) Successful in 36s
Deploy Cluster / Ansible (push) Successful in 4m18s
2026-03-24 19:40:37 +00:00
ecf17113fb Fix Rancher deployment: add cattle-system namespace, fix Traefik config with port 9442
All checks were successful
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Successful in 4m27s
2026-03-24 19:09:28 +00:00
4ffbcfa312 Add Rancher management UI
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m13s
Deploy Cluster / Ansible (push) Successful in 8m52s
2026-03-24 01:53:04 +00:00
8745bcda47 Fix Weave GitOps image tag - remove invalid v0.41.0
All checks were successful
Deploy Cluster / Terraform (push) Successful in 40s
Deploy Cluster / Ansible (push) Successful in 4m33s
The version v0.41.0 doesn't exist in the registry. Removing explicit
image tag to let the chart use its default compatible version.
2026-03-24 01:39:48 +00:00
e47ec2a3e7 Update Weave GitOps to v0.41.0 to support HelmRelease v2 API
All checks were successful
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Successful in 4m30s
Fixes error: 'no matches for kind HelmRelease in version v2beta1'

The cluster uses HelmRelease v2 API but Weave GitOps v0.38.0 was looking
for the old v2beta1 API. Updated image tag to v0.41.0 which supports
the newer API version.
2026-03-24 01:33:10 +00:00
45c899d2bd Configure Weave GitOps to use Doppler-managed admin credentials
All checks were successful
Deploy Cluster / Terraform (push) Successful in 39s
Deploy Cluster / Ansible (push) Successful in 4m41s
Changes:
- Enable adminUser creation but disable Helm-managed secret
- Use ExternalSecret (cluster-user-auth) from Doppler instead
- Doppler secrets: WEAVE_GITOPS_ADMIN_USERNAME and WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH
- Added cluster-user-auth to viewSecretsResourceNames for RBAC

Login credentials are now managed via Doppler and External Secrets Operator.
2026-03-24 01:01:30 +00:00
0e52d8f159 Use Tailscale DNS names instead of IPs for TLS SANs
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m21s
Deploy Cluster / Ansible (push) Successful in 9m0s
Changed from hardcoded Tailscale IPs to DNS names:
- k8s-cluster-cp-1.silverside-gopher.ts.net
- k8s-cluster-cp-2.silverside-gopher.ts.net
- k8s-cluster-cp-3.silverside-gopher.ts.net

This is more robust since Tailscale IPs change on rebuild,
but DNS names remain consistent.

After next rebuild, cluster accessible via:
- kubectl --server=https://k8s-cluster-cp-1.silverside-gopher.ts.net:6443
2026-03-23 23:50:48 +00:00
4726db2b5b Add Tailscale IPs to k3s TLS SANs for secure tailnet access
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m30s
Deploy Cluster / Ansible (push) Successful in 9m48s
Changes:
- Add tailscale_control_plane_ips list to k3s-server defaults
- Include all 3 control plane Tailscale IPs (100.120.55.97, 100.108.90.123, 100.92.149.85)
- Update primary k3s install to add Tailscale IPs to TLS certificates
- Enables kubectl access via Tailscale without certificate errors

After next deploy, cluster will be accessible via:
- kubectl --server=https://100.120.55.97:6443 (or any CP tailscale IP)
- kubectl --server=https://k8s-cluster-cp-1:6443 (via tailscale DNS)
2026-03-23 23:04:00 +00:00
90d105e5ea Fix kube_api_endpoint variable passing for HA cluster
All checks were successful
Deploy Cluster / Terraform (push) Successful in 2m18s
Deploy Cluster / Ansible (push) Successful in 8m55s
- Remove circular variable reference in site.yml
- Add kube_api_endpoint default to k3s-server role
- Variable is set via inventory group_vars and passed to role
- Primary CP now correctly adds LB IP to TLS SANs

Note: Existing cluster needs destroy/rebuild to regenerate certificates.
2026-03-23 03:01:53 +00:00
952a80a742 Fix HA cluster join via Load Balancer private IP
Some checks failed
Deploy Cluster / Terraform (push) Successful in 36s
Deploy Cluster / Ansible (push) Failing after 3m5s
Changes:
- Use LB private IP (10.0.1.5) instead of public IP for cluster joins
- Add LB private IP to k3s TLS SANs on primary control plane
- This allows secondary CPs and workers to verify certificates when joining via LB

Fixes x509 certificate validation error when joining via LB public IP.
2026-03-23 02:56:41 +00:00
4965017b86 Fix Load Balancer network attachment
Some checks failed
Deploy Cluster / Terraform (push) Successful in 54s
Deploy Cluster / Ansible (push) Failing after 3m44s
Add hcloud_load_balancer_network resource to attach LB to private network.
This is required before targets can use use_private_ip=true.
LB gets IP 10.0.1.5 on the private network.
2026-03-23 02:44:35 +00:00
b2b9c38b91 Fix Load Balancer output attribute - use ipv4 instead of ipv4_address
Some checks failed
Deploy Cluster / Terraform (push) Failing after 1m37s
Deploy Cluster / Ansible (push) Has been skipped
2026-03-23 02:40:50 +00:00
ff31cb4e74 Implement HA control plane with Load Balancer (3-3 topology)
Some checks failed
Deploy Cluster / Terraform (push) Failing after 10s
Deploy Cluster / Ansible (push) Has been skipped
Major changes:
- Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33)
- Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API
- Terraform: Add kube_api_lb_ip output
- Ansible: Add community.network collection to requirements
- Ansible: Update inventory to include LB endpoint
- Ansible: Configure secondary CPs and workers to join via LB
- Ansible: Add k3s_join_endpoint variable for HA joins
- Workflow: Add imports for cp-2, cp-3, and worker-3
- Docs: Update STABLE_BASELINE.md with HA topology and phase gates

Topology:
- 3 control planes (cx23 - 2 vCPU, 8GB RAM each)
- 3 workers (cx33 - 4 vCPU, 16GB RAM each)
- 1 Load Balancer (lb11) routing to all 3 control planes on port 6443
- Workers and secondary CPs join via LB endpoint for HA

Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)
2026-03-23 02:39:39 +00:00