Commit Graph

321 Commits

Author SHA1 Message Date
micqdf 6c6b9d20ca update README
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-04-22 01:14:21 +00:00
micqdf c3a2f25c94 docs: record validated Rancher restore drill
Deploy Cluster / Terraform (push) Successful in 2m11s
Deploy Cluster / Ansible (push) Successful in 10m9s
Update the baseline to treat Rancher backup and restore validation as part
of the accepted platform state, and capture the successful live drill run
performed on 2026-04-18.
2026-04-18 21:27:42 +00:00
micqdf 7385c2263e fix: add tailnet smoke checks and move Tailscale operator to stable
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m55s
Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
2026-04-18 19:59:13 +00:00
micqdf 60f466ab98 remove Weave GitOps addon
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Successful in 5m37s
Drop the Flux UI addon and its Tailscale exposure because the UI lags the
current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed
through the controllers themselves and use Rancher or the flux CLI for access.
2026-04-18 18:44:55 +00:00
micqdf b20356e9fe fix: only clean stale Tailscale names before proxies exist
Deploy Cluster / Terraform (push) Failing after 51s
Deploy Cluster / Ansible (push) Has been skipped
The Tailscale cleanup role was deleting reserved service hostnames on later
deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy
nodes from the tailnet. Skip cleanup whenever the current cluster already has
those Tailscale services, while still allowing cleanup on fresh rebuilds.
2026-04-18 18:16:27 +00:00
micqdf 2ba6b6a896 fix: remove unused Flux CLI install from deploy workflow
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m40s
The deploy pipeline never uses the flux binary after installation, so the
GitHub release download only adds a flaky failure point. Remove the step and
keep the bootstrap path kubectl-only.
2026-04-18 17:45:59 +00:00
micqdf 9126de1423 fix: Align Prometheus external URL with Tailscale service port
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Failing after 4m52s
Prometheus is exposed on port 9090 through the Tailscale LoadBalancer
service, so the configured external URL and repo docs should match the
actual address users reach after rebuilds.
2026-04-18 17:11:16 +00:00
micqdf 4532b9ed74 chore: trigger rebuild
Deploy Cluster / Terraform (push) Successful in 2m8s
Deploy Cluster / Ansible (push) Successful in 12m54s
2026-04-18 06:09:54 +00:00
micqdf 68dbd2e5b7 fix: Reserve Tailscale service hostnames and tag exposed proxies
Deploy Cluster / Terraform (push) Successful in 53s
Deploy Cluster / Ansible (push) Successful in 6m3s
Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so
stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale
services so operator-managed proxies are provisioned with explicit prod/service
tags from the tailnet policy.
2026-04-18 05:48:26 +00:00
micqdf ceefcc3b29 cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround
Deploy Cluster / Terraform (push) Successful in 2m21s
Deploy Cluster / Ansible (push) Successful in 13m9s
- Remove ansible/roles/private-access/ (replaced by Tailscale LB services)
- Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs)
- Remove orphaned kustomization-traefik-config.yaml (no backing directory)
- Simplify CI: remove SA patch + job deletion workaround for rancher-backup
  (now handled by postRenderer in HelmRelease)
- Update AGENTS.md to reflect current architecture
2026-04-02 01:21:23 +00:00
micqdf 0d339b3163 fix: Use rancher/kubectl image for rancher-backup hook
Deploy Cluster / Terraform (push) Successful in 53s
Deploy Cluster / Ansible (push) Successful in 5m41s
bitnami/kubectl:1.34 tag doesn't exist. rancher/kubectl is already
available in the cluster's image cache.
2026-04-02 01:00:27 +00:00
micqdf 30ccf13c82 fix: Use postRenderer to replace broken kuberlr-kubectl image in rancher-backup hook
Deploy Cluster / Terraform (push) Successful in 55s
Deploy Cluster / Ansible (push) Has been cancelled
The chart's post-install hook hardcodes rancher/kuberlr-kubectl which
can't download kubectl. Use Flux postRenderers to patch the job image
to bitnami/kubectl at render time.
2026-04-02 00:51:50 +00:00
micqdf 75e3604f30 fix: Skip post-install hooks for rancher-backup HelmRelease
Deploy Cluster / Terraform (push) Successful in 57s
Deploy Cluster / Ansible (push) Has been cancelled
The chart's post-install hook uses rancher/kuberlr-kubectl which fails
to download kubectl. The SA automountServiceAccountToken is managed
manually, so the hook is unnecessary.
2026-04-02 00:45:03 +00:00
micqdf e4235a6e58 fix: Correct Flux UI pod selector labels to match deployed weave-gitops labels
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Successful in 20m36s
Actual labels are app.kubernetes.io/name=weave-gitops and
app.kubernetes.io/instance=flux-system-weave-gitops.
2026-04-01 02:08:12 +00:00
micqdf ea2d534171 fix: Use admin.existingSecret for Grafana creds from Doppler
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Successful in 20m42s
Revert to idiomatic Grafana chart approach. ExternalSecret creates the
secret with admin-user/admin-password keys before Grafana's first start
on fresh cluster creation.
2026-04-01 01:41:49 +00:00
micqdf a1b9fe6aa6 fix: Use Flux valuesFrom to inject Doppler Grafana creds as Helm values
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 20m38s
Switch from admin.existingSecret to valuesFrom so Flux reads the
Doppler-managed secret and injects credentials as standard Helm values.
2026-03-31 23:40:54 +00:00
micqdf 33765657ec fix: Correct pod selectors for Prometheus and Flux Tailscale services, use Doppler for Grafana creds
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Successful in 21m0s
Prometheus needs operator.prometheus.io/name label selector. Flux UI pods
are labeled gitops-server not weave-gitops. Grafana now reads admin creds
from Doppler via ExternalSecret instead of hardcoded values.
2026-03-31 22:54:57 +00:00
micqdf b8f64fa952 feat: Expose Grafana, Prometheus, and Flux UI via Tailscale LoadBalancer services
Deploy Cluster / Terraform (push) Successful in 55s
Deploy Cluster / Ansible (push) Successful in 20m47s
Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB
services matching the existing Rancher pattern. Each service gets its own
tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).
2026-03-31 08:53:28 +00:00
micqdf 569d741751 push
Deploy Cluster / Terraform (push) Successful in 2m37s
Deploy Cluster / Ansible (push) Successful in 25m37s
2026-03-31 02:46:55 +00:00
micqdf 89e53d9ec9 fix: Handle restricted B2 keys and safe JSON parsing in restore step
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 20m48s
2026-03-31 01:43:04 +00:00
micqdf 5a2551f40a fix: Fix flux CLI download URL - use correct GitHub URL with v prefix on version
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Failing after 21m52s
2026-03-30 03:11:40 +00:00
micqdf 8c7b62c024 feat: Automate Rancher backup restore in CI pipeline
Deploy Cluster / Terraform (push) Successful in 2m18s
Deploy Cluster / Ansible (push) Failing after 6m28s
- Wait for Rancher and rancher-backup operator to be ready
- Patch default SA in cattle-resources-system (fixes post-install hook failure)
- Clean up failed patch-sa jobs
- Force reconcile rancher-backup HelmRelease
- Find latest backup from B2 using Backblaze API
- Create Restore CR to restore Rancher state from latest backup
- Wait for restore to complete before continuing
2026-03-30 01:56:29 +00:00
micqdf a1f07f863a docs: Update restore template with real Backup CR format
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Successful in 6m2s
Include actual restore CR spec and note the latest backup filename for reference.
2026-03-30 00:09:53 +00:00
micqdf 2c3a49c2e0 fix: Rename B2 secret keys to match rancher-backup operator expectations
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Has been cancelled
The operator expects accessKey/secretKey, not aws_access_key_id/aws_secret_access_key.
2026-03-30 00:05:13 +00:00
micqdf a7ce3dcc1a fix: Remove s3 block from rancher-backup HelmRelease values
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 6m12s
The S3 config caused the operator to try downloading kubectl, which fails in the container.
S3 credentials are correctly configured in the Backup CR and ExternalSecret instead.
2026-03-29 23:47:21 +00:00
micqdf 0ab9418458 fix: Re-add HTTPS port to Tailscale LB for Rancher
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Successful in 6m6s
Rancher now manages its own TLS (no longer tls:external), so it serves
HTTPS on port 443. The Tailscale LoadBalancer needs to expose both
HTTP (80) and HTTPS (443) targeting the corresponding container ports.
2026-03-29 23:04:49 +00:00
micqdf c251672618 fix: Configure S3 bucketName for rancher-backup operator
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-29 23:01:18 +00:00
micqdf 89364e8f37 fix: Add dependsOn for rancher-backup operator to wait for CRDs
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-29 22:57:22 +00:00
micqdf 20d7a6f777 fix: Install rancher-backup CRD chart before operator
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Has been cancelled
The rancher-backup operator requires CRDs from the rancher-backup-crd
chart to be installed first.
2026-03-29 22:51:34 +00:00
micqdf 22ce5fd6f4 feat: Add cert-manager as dependency for Rancher
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m59s
Rancher requires cert-manager when managing its own TLS (not tls:external).
Added cert-manager HelmRelease with CRDs enabled.
2026-03-29 22:36:30 +00:00
micqdf afb1782d38 fix: Separate Backup CRs into their own kustomization
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Successful in 5m57s
The Backup and Restore CRs need the rancher-backup CRDs to exist first.
Moved them to a separate kustomization that depends on the operator being ready.
2026-03-29 22:22:29 +00:00
micqdf 48870433bf fix: Remove tls:external from Rancher HelmRelease
Deploy Cluster / Terraform (push) Failing after 55s
Deploy Cluster / Ansible (push) Has been skipped
With Tailscale LoadBalancer, TLS is not actually terminated at the edge.
The Tailscale proxy does TCP passthrough, so Rancher must serve its own
TLS certs. Setting tls: external caused Rancher to listen HTTP-only,
which broke HTTPS access through Tailscale.
2026-03-29 22:19:23 +00:00
micqdf f2c506b350 refactor: Replace CNPG external DB with rancher-backup operator
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Successful in 6m5s
Rancher 2.x uses embedded etcd, not an external PostgreSQL database.
The CATTLE_DB_CATTLE_* env vars are Rancher v1 only and were ignored.

- Remove all CNPG (CloudNativePG) cluster, operator, and related configs
- Remove external DB env vars from Rancher HelmRelease
- Remove rancher-db-password ExternalSecret
- Add rancher-backup operator HelmRelease (v106.0.2+up8.1.0)
- Add B2 credentials ExternalSecret for backup storage
- Add recurring Backup CR (daily at 03:00, 7 day retention)
- Add commented-out Restore CR for rebuild recovery
- Update Flux dependency graph accordingly
2026-03-29 21:53:16 +00:00
micqdf efdf13976a fix: Handle missing 'online' field in Tailscale API response
Deploy Cluster / Terraform (push) Successful in 2m12s
Deploy Cluster / Ansible (push) Successful in 9m19s
2026-03-29 13:52:23 +00:00
micqdf 5269884408 feat: Auto-cleanup stale Tailscale devices before cluster boot
Deploy Cluster / Terraform (push) Successful in 2m17s
Deploy Cluster / Ansible (push) Failing after 6m35s
Adds tailscale-cleanup Ansible role that uses the Tailscale API to
delete offline devices matching reserved hostnames (e.g. rancher).
Runs during site.yml before Finalize to prevent hostname collisions
like rancher-1 on rebuild.

Requires TAILSCALE_API_KEY (API access token) passed as extra var.
2026-03-29 11:47:53 +00:00
micqdf 6e5b0518be feat: Add kubeconfig refresh script and fix Ansible Finalize to use public IP
Deploy Cluster / Terraform (push) Successful in 53s
Deploy Cluster / Ansible (push) Successful in 5m25s
- scripts/refresh-kubeconfig.sh fetches a fresh kubeconfig from CP1
- Ansible site.yml Finalize step now uses public IP instead of Tailscale
  hostname for the kubeconfig server address
- Updated AGENTS.md with kubeconfig refresh instructions
2026-03-29 03:31:36 +00:00
micqdf 905d069e91 fix: Add serverName to CNPG externalClusters for B2 recovery
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m22s
CNPG uses the external cluster name (b2-backup) as the barman server
name by default, but the backups were stored under server name rancher-db.
2026-03-29 03:22:19 +00:00
micqdf 25ba4b7115 fix: Add skipEmptyWalArchiveCheck annotation and B2 secret healthcheck to CNPG
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m22s
- Skip WAL archive emptiness check so recovery works when restoring over
  an existing backup archive in B2
- Add healthCheck for b2-credentials secret in CNPG kustomization to
  prevent recovery from starting before ExternalSecret has synced
2026-03-29 03:15:23 +00:00
micqdf 6a593fd559 feat: Add B2 recovery bootstrap to CNPG cluster
Deploy Cluster / Terraform (push) Successful in 2m6s
Deploy Cluster / Ansible (push) Successful in 8m16s
2026-03-29 00:22:24 +00:00
micqdf 936f54a1b5 fix: Restore canonical Rancher tailnet hostname
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Successful in 6m1s
2026-03-29 00:00:39 +00:00
micqdf c9df11e65f fix: Align Rancher tailnet hostname with live proxy
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 6m1s
2026-03-28 23:47:09 +00:00
micqdf a3c238fda9 fix: Apply Rancher server URL after chart install
Deploy Cluster / Terraform (push) Successful in 2m43s
Deploy Cluster / Ansible (push) Successful in 10m39s
2026-03-28 23:12:59 +00:00
micqdf a15fa50302 fix: Use Doppler-backed Rancher bootstrap password
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m43s
2026-03-28 22:51:38 +00:00
micqdf 0f4f0b09fb fix: Add Rancher DB password ExternalSecret
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m42s
2026-03-28 22:42:05 +00:00
micqdf 4c002a870c fix: Remove invalid Rancher server-url manifest
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-28 22:39:31 +00:00
micqdf 43d11ac7e6 docs: Add agent guidance and sync Rancher docs
Deploy Cluster / Terraform (push) Successful in 2m33s
Deploy Cluster / Ansible (push) Successful in 9m44s
2026-03-28 22:13:37 +00:00
micqdf 8c5edcf0a1 fix: Set Rancher server URL to tailnet hostname
Deploy Cluster / Terraform (push) Successful in 1m0s
Deploy Cluster / Ansible (push) Successful in 6m27s
2026-03-28 04:07:44 +00:00
micqdf a81da0d178 feat: Expose Rancher via Tailscale hostname
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 6m42s
2026-03-28 03:59:02 +00:00
micqdf 2a72527c79 fix: Switch Traefik from LoadBalancer to NodePort, remove unused Hetzner LB
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 6m25s
2026-03-28 03:21:19 +00:00
micqdf 7cb3b84ecb feat: Replace custom pgdump job with CNPG ScheduledBackup
Deploy Cluster / Terraform (push) Successful in 1m30s
Deploy Cluster / Ansible (push) Has been cancelled
2026-03-28 03:15:39 +00:00