Commit Graph

105 Commits

Author SHA1 Message Date
micqdf a7d540ca65 fix: stop forcing Flux releases during deploy bootstrap
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Successful in 21m12s
Remove the HelmRelease reset/force annotations from the deploy workflow now
that the cluster can converge on its own. The runtime waits remain, but CI no
longer re-triggers Rancher and NFS churn on every bootstrap attempt.
2026-04-23 00:35:31 +00:00
micqdf 098bd98876 fix: wait on Rancher and storage runtime objects during bootstrap
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 25m19s
Flux can leave HelmRelease and Kustomization conditions stale after transient
chart fetch or image pull failures even when the underlying workloads recover.
Switch the deploy workflow to wait on the concrete runtime resources we care
about: the NFS provisioner deployment and StorageClass, Rancher deployment,
webhook, cert-manager issuer/certificate, and the rancher-backup deployment.
2026-04-22 18:41:09 +00:00
micqdf 9c0523e880 fix: pre-pull Rancher images and reset Rancher release during bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 27m30s
Rancher installs were stalling on transient Docker Hub TLS handshake timeouts
for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the
required images onto all nodes after k3s comes up, extend the Rancher HelmRelease
timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher
so bootstrap can recover from stale failed remediation state.
2026-04-22 11:00:54 +00:00
micqdf 8372d562ad fix: reset and force nfs helmrelease during bootstrap
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 20m22s
When the NFS storage HelmRelease has already entered a failed remediation state,
a plain reconcile request is not enough to clear the stale failure counters.
Send requestedAt, resetAt, and forceAt together so helm-controller retries the
release cleanly before the workflow waits on addon-nfs-storage.
2026-04-22 10:35:32 +00:00
micqdf 1bb11dfe3a fix: force nfs storage reconcile during flux bootstrap
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 19m0s
The NFS HelmRelease can remain in a failed state from an earlier bootstrap
attempt even after the backing NFS export is corrected and the pod becomes
healthy. Request a fresh reconcile of the HelmRelease and addon kustomization
before waiting on addon-nfs-storage so the bootstrap step can observe the
recovered state.
2026-04-22 10:08:20 +00:00
micqdf 71bdc6a709 fix: extend Flux bootstrap timeouts on fresh clusters
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 18m44s
Fresh Proxmox clusters need longer for the Flux controller rollouts and first
GitRepository/Kustomization reconciliations, especially while images are still
being pulled onto the control plane. Increase the bootstrap wait windows so CI
does not fail while the controllers are still converging.
2026-04-22 08:36:27 +00:00
micqdf 714f20417b fix: tolerate control-plane taint when pinning Flux to cp1
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 10m19s
Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the
control-plane node is tainted NoSchedule. Add the matching toleration in both
the checked-in patch manifest and the bootstrap workflow so the controllers can
actually schedule and roll out on cp-1.
2026-04-22 05:05:15 +00:00
micqdf b1dae28aa5 feat: migrate cluster baseline from Hetzner to Proxmox
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
2026-04-22 03:02:13 +00:00
micqdf 7385c2263e fix: add tailnet smoke checks and move Tailscale operator to stable
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m55s
Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
2026-04-18 19:59:13 +00:00
micqdf 2ba6b6a896 fix: remove unused Flux CLI install from deploy workflow
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m40s
The deploy pipeline never uses the flux binary after installation, so the
GitHub release download only adds a flaky failure point. Remove the step and
keep the bootstrap path kubectl-only.
2026-04-18 17:45:59 +00:00
micqdf ceefcc3b29 cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround
Deploy Cluster / Terraform (push) Successful in 2m21s
Deploy Cluster / Ansible (push) Successful in 13m9s
- Remove ansible/roles/private-access/ (replaced by Tailscale LB services)
- Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs)
- Remove orphaned kustomization-traefik-config.yaml (no backing directory)
- Simplify CI: remove SA patch + job deletion workaround for rancher-backup
  (now handled by postRenderer in HelmRelease)
- Update AGENTS.md to reflect current architecture
2026-04-02 01:21:23 +00:00
micqdf 89e53d9ec9 fix: Handle restricted B2 keys and safe JSON parsing in restore step
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 20m48s
2026-03-31 01:43:04 +00:00
micqdf 5a2551f40a fix: Fix flux CLI download URL - use correct GitHub URL with v prefix on version
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Failing after 21m52s
2026-03-30 03:11:40 +00:00
micqdf 8c7b62c024 feat: Automate Rancher backup restore in CI pipeline
Deploy Cluster / Terraform (push) Successful in 2m18s
Deploy Cluster / Ansible (push) Failing after 6m28s
- Wait for Rancher and rancher-backup operator to be ready
- Patch default SA in cattle-resources-system (fixes post-install hook failure)
- Clean up failed patch-sa jobs
- Force reconcile rancher-backup HelmRelease
- Find latest backup from B2 using Backblaze API
- Create Restore CR to restore Rancher state from latest backup
- Wait for restore to complete before continuing
2026-03-30 01:56:29 +00:00
micqdf 5269884408 feat: Auto-cleanup stale Tailscale devices before cluster boot
Deploy Cluster / Terraform (push) Successful in 2m17s
Deploy Cluster / Ansible (push) Failing after 6m35s
Adds tailscale-cleanup Ansible role that uses the Tailscale API to
delete offline devices matching reserved hostnames (e.g. rancher).
Runs during site.yml before Finalize to prevent hostname collisions
like rancher-1 on rebuild.

Requires TAILSCALE_API_KEY (API access token) passed as extra var.
2026-03-29 11:47:53 +00:00
micqdf ff31cb4e74 Implement HA control plane with Load Balancer (3-3 topology)
Deploy Cluster / Terraform (push) Failing after 10s
Deploy Cluster / Ansible (push) Has been skipped
Major changes:
- Terraform: Scale to 3 control planes (cx23) + 3 workers (cx33)
- Terraform: Add Hetzner Load Balancer (lb11) for Kubernetes API
- Terraform: Add kube_api_lb_ip output
- Ansible: Add community.network collection to requirements
- Ansible: Update inventory to include LB endpoint
- Ansible: Configure secondary CPs and workers to join via LB
- Ansible: Add k3s_join_endpoint variable for HA joins
- Workflow: Add imports for cp-2, cp-3, and worker-3
- Docs: Update STABLE_BASELINE.md with HA topology and phase gates

Topology:
- 3 control planes (cx23 - 2 vCPU, 8GB RAM each)
- 3 workers (cx33 - 4 vCPU, 16GB RAM each)
- 1 Load Balancer (lb11) routing to all 3 control planes on port 6443
- Workers and secondary CPs join via LB endpoint for HA

Cost impact: +~€26/month (2 extra CPs + 1 extra worker + LB)
2026-03-23 02:39:39 +00:00
micqdf cadfedacf1 Fix providerID health check - use shell module for piped grep
Deploy Cluster / Terraform (push) Successful in 1m47s
Deploy Cluster / Ansible (push) Failing after 18m4s
2026-03-22 22:55:55 +00:00
micqdf 561cd67b0c Enable Hetzner CCM and CSI for cloud provider integration
Deploy Cluster / Terraform (push) Successful in 30s
Deploy Cluster / Ansible (push) Failing after 3m21s
- Enable --kubelet-arg=cloud-provider=external on all nodes (control planes and workers)
- Activate CCM Kustomization with 10m timeout for Hetzner cloud-controller-manager
- Activate CSI Kustomization with dependsOn CCM and 10m timeout for hcloud-csi
- Update deploy workflow to wait for CCM/CSI readiness (600s timeout)
- Add providerID verification to post-deploy health checks

This enables proper cloud provider integration with Hetzner CCM for node
labeling and Hetzner CSI for persistent volume provisioning.
2026-03-22 22:26:21 +00:00
micqdf 7b5d794dfc fix: update health checks for deferred observability
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-22 01:04:27 +00:00
micqdf 8643bbfc12 fix: defer observability to get clean baseline
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-03-22 01:03:55 +00:00
micqdf 84f446c2e6 fix: restore observability timeouts to 5 minutes
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Failing after 8m38s
2026-03-22 00:43:37 +00:00
micqdf 989848fa89 fix: increase observability timeouts to 10 minutes
Deploy Cluster / Terraform (push) Successful in 2m1s
Deploy Cluster / Ansible (push) Failing after 13m54s
2026-03-21 19:34:43 +00:00
micqdf 56e5807474 fix: create doppler ClusterSecretStore after ESO is installed
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Failing after 8m31s
2026-03-21 19:19:43 +00:00
micqdf a01cf435d4 fix: skip ccm/csi waits for stable baseline - using k3s embedded
Deploy Cluster / Terraform (push) Successful in 37s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-21 18:40:53 +00:00
micqdf 84f77c4a68 fix: use kubectl patch instead of apply for flux controller nodeSelector
Deploy Cluster / Terraform (push) Successful in 38s
Deploy Cluster / Ansible (push) Failing after 9m41s
2026-03-21 18:05:41 +00:00
micqdf 2e4196688c fix: bootstrap flux in phases - crds first, then resources
Deploy Cluster / Terraform (push) Successful in 38s
Deploy Cluster / Ansible (push) Failing after 3m19s
2026-03-21 17:42:39 +00:00
micqdf fcf7f139ff fix: use public api endpoint for flux bootstrap
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Failing after 2m16s
2026-03-21 00:07:51 +00:00
micqdf 7139ae322d fix: bootstrap flux during cluster deploy
Deploy Cluster / Terraform (push) Successful in 38s
Deploy Cluster / Ansible (push) Failing after 3m21s
2026-03-20 10:37:11 +00:00
micqdf 522626a52b refactor: simplify stable cluster baseline
Deploy Cluster / Terraform (push) Successful in 1m48s
Deploy Cluster / Ansible (push) Failing after 4m7s
2026-03-20 02:24:37 +00:00
micqdf 6f2e056b98 feat: sync runtime secrets from doppler
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Successful in 9m56s
2026-03-09 00:25:41 +00:00
micqdf 1c39274df7 feat: stabilize tailscale observability exposure with declarative proxy class
Deploy Cluster / Terraform (push) Successful in 54s
Deploy Cluster / Ansible (push) Successful in 22m19s
2026-03-04 01:37:00 +00:00
micqdf 63247b79a6 fix: harden Tailscale operator rollout with preflight and diagnostics
Deploy Cluster / Terraform (push) Successful in 47s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-02 21:39:47 +00:00
micqdf b30977a158 feat: deploy lightweight observability stack via Ansible
Deploy Cluster / Terraform (push) Successful in 45s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-02 01:33:41 +00:00
micqdf d92bde78f4 chore: enforce CSI smoke test and add post-deploy health checks
Deploy Cluster / Terraform (push) Successful in 42s
Deploy Cluster / Ansible (push) Failing after 8m20s
2026-03-01 23:45:27 +00:00
micqdf 2bc9749b81 feat: switch kubeconfig to tailnet endpoint and deploy Hetzner CSI
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Successful in 3m12s
2026-03-01 17:12:12 +00:00
micqdf 54717cccad fix: allow current CI runner IP through firewall before Ansible
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Successful in 5m13s
2026-03-01 14:50:55 +00:00
micqdf fffd3876fb fix: remove empty TF_VAR CIDR envs causing plan parse errors
Deploy Cluster / Terraform (push) Successful in 39s
Deploy Cluster / Ansible (push) Failing after 1m28s
2026-03-01 14:47:32 +00:00
micqdf 86c38e385f fix: remove CI tailscale dependency and allow runner CIDR exception
Deploy Cluster / Terraform (push) Failing after 31s
Deploy Cluster / Ansible (push) Has been skipped
2026-03-01 14:08:08 +00:00
micqdf d29a428f2d fix: robust tailscaled startup in CI runner
Deploy Cluster / Terraform (push) Successful in 34s
Deploy Cluster / Ansible (push) Failing after 2m44s
2026-03-01 13:57:12 +00:00
micqdf a8ef173713 fix: start tailscaled daemon before tailscale up in CI
Deploy Cluster / Terraform (push) Successful in 35s
Deploy Cluster / Ansible (push) Failing after 2m15s
2026-03-01 13:52:20 +00:00
micqdf 41d0abda16 fix: auto-import existing Hetzner servers into Terraform state in CI
Deploy Cluster / Terraform (push) Failing after 21s
Deploy Cluster / Ansible (push) Has been skipped
2026-03-01 13:27:02 +00:00
micqdf 011c220f59 fix: avoid server replacement; install tailscale via Ansible
Deploy Cluster / Terraform (push) Failing after 22s
Deploy Cluster / Ansible (push) Has been skipped
2026-03-01 04:51:19 +00:00
micqdf 1eebfe77df feat: integrate tailscale access and lock SSH/API to tailnet
Deploy Cluster / Terraform (push) Failing after 20s
Deploy Cluster / Ansible (push) Has been skipped
2026-03-01 04:04:56 +00:00
micqdf 7230b2b6c8 fix: Use --break-system-packages for pip on Debian 12
Deploy Cluster / Terraform (push) Successful in 20s
Deploy Cluster / Ansible (push) Failing after 1m12s
2026-02-28 22:50:31 +00:00
micqdf f40a090c7c fix: Install pip via apt before installing Python packages
Deploy Cluster / Terraform (push) Successful in 19s
Deploy Cluster / Ansible (push) Failing after 22s
2026-02-28 22:47:24 +00:00
micqdf 19ba491c54 fix: Use system Python instead of setup-python action
Deploy Cluster / Terraform (push) Successful in 21s
Deploy Cluster / Ansible (push) Failing after 12s
2026-02-28 22:45:50 +00:00
micqdf 34c2b6895e fix: Use Python 3.12 instead of 3.11
Deploy Cluster / Terraform (push) Successful in 18s
Deploy Cluster / Ansible (push) Failing after 14s
2026-02-28 22:44:46 +00:00
micqdf 2fcc8cff77 fix: Ansible fetches outputs directly from Terraform state instead of artifacts
Deploy Cluster / Terraform (push) Successful in 19s
Deploy Cluster / Ansible (push) Failing after 18s
2026-02-28 22:43:26 +00:00
micqdf 683f994905 fix: Create outputs directory before saving terraform outputs
Deploy Cluster / Terraform (push) Successful in 2m34s
Deploy Cluster / Ansible (push) Failing after 3m48s
2026-02-28 22:27:24 +00:00
micqdf ebe86cfacf fix: Typo in chmod path id_ed255 -> id_ed25519
Deploy Cluster / Terraform (push) Failing after 14s
Deploy Cluster / Ansible (push) Has been skipped
2026-02-28 21:27:37 +00:00