Commit Graph

351 Commits

Author SHA1 Message Date
micqdf 31e95eb227 fix: pre-pull Flux controllers before bootstrap rollout
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 16m39s
2026-04-23 20:36:57 +00:00
micqdf 12675417bd fix: use correct namespace and deployment name for ESO rollout check
Deploy Cluster / Terraform (push) Successful in 1m36s
Deploy Cluster / Ansible (push) Failing after 40m40s
The ESO deployment is named external-secrets-external-secrets in the
external-secrets namespace, not external-secrets in kube-system.
2026-04-23 19:00:15 +00:00
micqdf 8e081ddfda fix: wait on ESO deployment directly instead of Flux Kustomization status
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 19m8s
The addon-external-secrets Flux Kustomization was timing out during bootstrap
because image pulls on fresh Proxmox VMs are slow. The critical dependency is
the ESO deployment being available for the Doppler ClusterSecretStore. Replace
the Kustomization readiness check with direct checks for ESO CRD establishment
and deployment rollout, which are the actual prerequisites for the next step.
2026-04-23 07:32:19 +00:00
micqdf 4b7517c9c5 fix: health-check external-secrets addon via HelmRelease only
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 17m22s
The external-secrets Kustomization was still using wait=true, which makes Flux
hold the addon in a failed state when the HelmRepository has transient fetch
errors even though the HelmRelease and runtime controller deployments are
healthy. Switch it to an explicit HelmRelease health check like the other
helm-backed addons.
2026-04-23 07:11:21 +00:00
micqdf f9bc53723f fix: make image pre-pull roles fully best effort
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 22m46s
The pre-pull roles were still blocking the playbook because they retried until
success and exhausted their retry budget during registry TLS timeouts. Keep the
image pulls as opportunistic cache warmers, but never let them fail the
bootstrap; log any missed images instead.
2026-04-23 06:41:21 +00:00
micqdf ee6417c18e fix: pre-pull core bootstrap images on cp1 before Flux bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Has been cancelled
Fresh clusters were repeatedly timing out while kubelet pulled the pause image,
k3s packaged component images, and Flux controller images onto the first
control plane. Pre-pull the core control-plane bootstrap images into
containerd on cp-1 so Flux and packaged addons start from a warm cache instead
of racing registry TLS timeouts.
2026-04-23 05:55:14 +00:00
micqdf 1156dc0203 fix: pre-pull kube-vip images before waiting for VIP
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 43m31s
The primary control plane was stalling because kubelet still had to pull both
the Rancher pause image and the kube-vip image before the DaemonSet pod could
become Ready. Pre-pull those images into containerd, extend the readiness wait,
and emit pod diagnostics if kube-vip still does not come up.
2026-04-23 03:55:52 +00:00
micqdf 4151027e01 fix: clean stale Tailscale node devices before bootstrap
Deploy Cluster / Terraform (push) Successful in 1m40s
Deploy Cluster / Ansible (push) Failing after 14m30s
Run the Tailscale cleanup role against the cluster hostnames before any node
reconnects to the tailnet. This removes stale offline cp/worker devices from
previous rebuilds so replacement VMs can reclaim their original hostnames
instead of getting -1 suffixes.
2026-04-23 03:25:17 +00:00
micqdf 9269e9df1b docs: add guide for deploying app repos to the cluster
Deploy Cluster / Terraform (push) Successful in 1m36s
Deploy Cluster / Ansible (push) Has been cancelled
Document the recommended two-repo model for application delivery, including
Flux attachment objects, Doppler/ExternalSecret wiring, Tailscale service
exposure, and the steps for enabling the suspended apps layer.
2026-04-23 02:43:00 +00:00
micqdf d9374bc209 fix: remove duplicate wait keys from helm addon kustomizations
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
The repo-only Kustomization healthCheck change accidentally left the original
wait:true keys in the Rancher and Rancher backup Kustomizations, which broke
the infrastructure kustomize build. Remove the duplicate keys so Flux can
apply the HelmRelease-only health checks cleanly.
2026-04-23 02:20:57 +00:00
micqdf c570a476b5 fix: make helm-based addon kustomizations health-check HelmReleases only
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Has been cancelled
These addon Kustomizations were using wait=true, which made Flux treat transient
HelmRepository fetch timeouts as addon failures even when the HelmRelease and
runtime workloads were healthy. Switch the affected Kustomizations to explicit
HelmRelease healthChecks so readiness reflects the actual deployed platform
state instead of repository fetch flakiness.
2026-04-23 02:15:45 +00:00
micqdf a7f11ccf94 fix: give Rancher more time to pass startup probe during upgrades
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Successful in 18m59s
Rancher needs longer than the chart default 2-minute startup probe budget on
this cluster while it restores local catalogs and finishes API startup. Extend
the startup probe failure threshold so Helm upgrades can complete instead of
restarting the new pod before it becomes ready.
2026-04-23 01:44:25 +00:00
micqdf a7d540ca65 fix: stop forcing Flux releases during deploy bootstrap
Deploy Cluster / Terraform (push) Successful in 32s
Deploy Cluster / Ansible (push) Successful in 21m12s
Remove the HelmRelease reset/force annotations from the deploy workflow now
that the cluster can converge on its own. The runtime waits remain, but CI no
longer re-triggers Rancher and NFS churn on every bootstrap attempt.
2026-04-23 00:35:31 +00:00
micqdf 098bd98876 fix: wait on Rancher and storage runtime objects during bootstrap
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 25m19s
Flux can leave HelmRelease and Kustomization conditions stale after transient
chart fetch or image pull failures even when the underlying workloads recover.
Switch the deploy workflow to wait on the concrete runtime resources we care
about: the NFS provisioner deployment and StorageClass, Rancher deployment,
webhook, cert-manager issuer/certificate, and the rancher-backup deployment.
2026-04-22 18:41:09 +00:00
micqdf 55d7b8201e fix: make Rancher image pre-pull best effort and disable managed SUC
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 32m19s
Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap
requirement. Treat image pre-pull as opportunistic and disable Rancher's
managed system-upgrade-controller feature so that image is removed from the
critical install path while Rancher and its webhook converge.
2026-04-22 11:33:13 +00:00
micqdf 9c0523e880 fix: pre-pull Rancher images and reset Rancher release during bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 27m30s
Rancher installs were stalling on transient Docker Hub TLS handshake timeouts
for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the
required images onto all nodes after k3s comes up, extend the Rancher HelmRelease
timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher
so bootstrap can recover from stale failed remediation state.
2026-04-22 11:00:54 +00:00
micqdf 8372d562ad fix: reset and force nfs helmrelease during bootstrap
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 20m22s
When the NFS storage HelmRelease has already entered a failed remediation state,
a plain reconcile request is not enough to clear the stale failure counters.
Send requestedAt, resetAt, and forceAt together so helm-controller retries the
release cleanly before the workflow waits on addon-nfs-storage.
2026-04-22 10:35:32 +00:00
micqdf 1bb11dfe3a fix: force nfs storage reconcile during flux bootstrap
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 19m0s
The NFS HelmRelease can remain in a failed state from an earlier bootstrap
attempt even after the backing NFS export is corrected and the pod becomes
healthy. Request a fresh reconcile of the HelmRelease and addon kustomization
before waiting on addon-nfs-storage so the bootstrap step can observe the
recovered state.
2026-04-22 10:08:20 +00:00
micqdf 624cd5aab6 fix: point NFS provisioner at active Proxmox host export
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 18m51s
The cluster nodes can reach the exported NFS path on 10.27.27.239, not
10.27.27.22. Update the storage addon and repo note so the NFS provisioner
mounts the live export and Flux health checks can converge.
2026-04-22 09:46:01 +00:00
micqdf 71bdc6a709 fix: extend Flux bootstrap timeouts on fresh clusters
Deploy Cluster / Terraform (push) Successful in 26s
Deploy Cluster / Ansible (push) Failing after 18m44s
Fresh Proxmox clusters need longer for the Flux controller rollouts and first
GitRepository/Kustomization reconciliations, especially while images are still
being pulled onto the control plane. Increase the bootstrap wait windows so CI
does not fail while the controllers are still converging.
2026-04-22 08:36:27 +00:00
micqdf 714f20417b fix: tolerate control-plane taint when pinning Flux to cp1
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 10m19s
Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the
control-plane node is tainted NoSchedule. Add the matching toleration in both
the checked-in patch manifest and the bootstrap workflow so the controllers can
actually schedule and roll out on cp-1.
2026-04-22 05:05:15 +00:00
micqdf c32bec34bc fix: quote kube-vip readiness jsonpath in bootstrap role
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 10m11s
The local kube-vip readiness probe used an unquoted jsonpath predicate,
which made kubectl treat Ready as an identifier instead of a string. Use a
quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod
properly before waiting on the API VIP.
2026-04-22 04:41:48 +00:00
micqdf 6519a7673d fix: wait for kube-vip on primary node during bootstrap
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 9m11s
The kube-vip DaemonSet is applied before the secondary control planes join,
so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not
exist in the cluster yet. Wait only for the primary node's kube-vip pod and
then verify the VIP is reachable on 6443.
2026-04-22 04:29:29 +00:00
micqdf d1c31cdb91 fix: rely on k3s service readiness instead of installer exit code
Deploy Cluster / Terraform (push) Successful in 27s
Deploy Cluster / Ansible (push) Failing after 8m9s
The k3s install script can return non-zero while systemd is still bringing the
service up, especially on worker agents. Do not fail immediately on the
installer command; wait for the service to become active and only emit
install diagnostics if the later readiness check fails.
2026-04-22 04:14:31 +00:00
micqdf b3e88712bd fix: derive cluster network interface from host facts
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 12m32s
The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18.
Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap
tracks the actual interface name instead of a guessed template default.
2026-04-22 03:50:03 +00:00
micqdf 06366ee5e6 fix: accept cloud-init exit code 2 after first boot
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 6m2s
Ubuntu cloud-init returns exit code 2 for some completed boots even when the
status output is 'done'. Treat that as a successful wait state so Ansible can
continue into the package install phase instead of aborting early.
2026-04-22 03:40:55 +00:00
micqdf 9a2d213114 fix: wait for cloud-init before package install during bootstrap
Deploy Cluster / Terraform (push) Successful in 29s
Deploy Cluster / Ansible (push) Failing after 2m36s
Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot,
which caused the Ansible common role to fail before the control plane could
finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the
final kubeconfig rewrite when no kubeconfig was fetched yet.
2026-04-22 03:34:53 +00:00
micqdf 9482a0f551 fix: skip clone storage override for linked Proxmox clones
Deploy Cluster / Terraform (push) Successful in 1m43s
Deploy Cluster / Ansible (push) Failing after 6m24s
The bpg/proxmox provider rejects clone.datastore_id when creating linked
clones. Only pass the target datastore when full clones are enabled so the
linked-clone baseline can provision from template 9000 successfully.
2026-04-22 03:22:50 +00:00
micqdf 5c53b8e06e fix: normalize Proxmox endpoint and stop dashboards self-trigger
Deploy Cluster / Terraform (push) Failing after 53s
Deploy Cluster / Ansible (push) Has been skipped
Accept Proxmox API endpoints with or without /api2/json in CI and local
tfvars, and avoid running the dashboards workflow just because its own
workflow file changed during platform migrations.
2026-04-22 03:13:22 +00:00
micqdf b1dae28aa5 feat: migrate cluster baseline from Hetzner to Proxmox
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
2026-04-22 03:02:13 +00:00
micqdf 6c6b9d20ca update README
Deploy Cluster / Ansible (push) Has been cancelled
Deploy Cluster / Terraform (push) Has been cancelled
2026-04-22 01:14:21 +00:00
micqdf c3a2f25c94 docs: record validated Rancher restore drill
Deploy Cluster / Terraform (push) Successful in 2m11s
Deploy Cluster / Ansible (push) Successful in 10m9s
Update the baseline to treat Rancher backup and restore validation as part
of the accepted platform state, and capture the successful live drill run
performed on 2026-04-18.
2026-04-18 21:27:42 +00:00
micqdf 7385c2263e fix: add tailnet smoke checks and move Tailscale operator to stable
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m55s
Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
2026-04-18 19:59:13 +00:00
micqdf 60f466ab98 remove Weave GitOps addon
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Successful in 5m37s
Drop the Flux UI addon and its Tailscale exposure because the UI lags the
current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed
through the controllers themselves and use Rancher or the flux CLI for access.
2026-04-18 18:44:55 +00:00
micqdf b20356e9fe fix: only clean stale Tailscale names before proxies exist
Deploy Cluster / Terraform (push) Failing after 51s
Deploy Cluster / Ansible (push) Has been skipped
The Tailscale cleanup role was deleting reserved service hostnames on later
deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy
nodes from the tailnet. Skip cleanup whenever the current cluster already has
those Tailscale services, while still allowing cleanup on fresh rebuilds.
2026-04-18 18:16:27 +00:00
micqdf 2ba6b6a896 fix: remove unused Flux CLI install from deploy workflow
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m40s
The deploy pipeline never uses the flux binary after installation, so the
GitHub release download only adds a flaky failure point. Remove the step and
keep the bootstrap path kubectl-only.
2026-04-18 17:45:59 +00:00
micqdf 9126de1423 fix: Align Prometheus external URL with Tailscale service port
Deploy Cluster / Terraform (push) Successful in 48s
Deploy Cluster / Ansible (push) Failing after 4m52s
Prometheus is exposed on port 9090 through the Tailscale LoadBalancer
service, so the configured external URL and repo docs should match the
actual address users reach after rebuilds.
2026-04-18 17:11:16 +00:00
micqdf 4532b9ed74 chore: trigger rebuild
Deploy Cluster / Terraform (push) Successful in 2m8s
Deploy Cluster / Ansible (push) Successful in 12m54s
2026-04-18 06:09:54 +00:00
micqdf 68dbd2e5b7 fix: Reserve Tailscale service hostnames and tag exposed proxies
Deploy Cluster / Terraform (push) Successful in 53s
Deploy Cluster / Ansible (push) Successful in 6m3s
Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so
stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale
services so operator-managed proxies are provisioned with explicit prod/service
tags from the tailnet policy.
2026-04-18 05:48:26 +00:00
micqdf ceefcc3b29 cleanup: Remove obsolete port-forwarding, deferred Traefik files, and CI workaround
Deploy Cluster / Terraform (push) Successful in 2m21s
Deploy Cluster / Ansible (push) Successful in 13m9s
- Remove ansible/roles/private-access/ (replaced by Tailscale LB services)
- Remove deferred observability ingress/traefik files (replaced by direct Tailscale LBs)
- Remove orphaned kustomization-traefik-config.yaml (no backing directory)
- Simplify CI: remove SA patch + job deletion workaround for rancher-backup
  (now handled by postRenderer in HelmRelease)
- Update AGENTS.md to reflect current architecture
2026-04-02 01:21:23 +00:00
micqdf 0d339b3163 fix: Use rancher/kubectl image for rancher-backup hook
Deploy Cluster / Terraform (push) Successful in 53s
Deploy Cluster / Ansible (push) Successful in 5m41s
bitnami/kubectl:1.34 tag doesn't exist. rancher/kubectl is already
available in the cluster's image cache.
2026-04-02 01:00:27 +00:00
micqdf 30ccf13c82 fix: Use postRenderer to replace broken kuberlr-kubectl image in rancher-backup hook
Deploy Cluster / Terraform (push) Successful in 55s
Deploy Cluster / Ansible (push) Has been cancelled
The chart's post-install hook hardcodes rancher/kuberlr-kubectl which
can't download kubectl. Use Flux postRenderers to patch the job image
to bitnami/kubectl at render time.
2026-04-02 00:51:50 +00:00
micqdf 75e3604f30 fix: Skip post-install hooks for rancher-backup HelmRelease
Deploy Cluster / Terraform (push) Successful in 57s
Deploy Cluster / Ansible (push) Has been cancelled
The chart's post-install hook uses rancher/kuberlr-kubectl which fails
to download kubectl. The SA automountServiceAccountToken is managed
manually, so the hook is unnecessary.
2026-04-02 00:45:03 +00:00
micqdf e4235a6e58 fix: Correct Flux UI pod selector labels to match deployed weave-gitops labels
Deploy Cluster / Terraform (push) Successful in 51s
Deploy Cluster / Ansible (push) Successful in 20m36s
Actual labels are app.kubernetes.io/name=weave-gitops and
app.kubernetes.io/instance=flux-system-weave-gitops.
2026-04-01 02:08:12 +00:00
micqdf ea2d534171 fix: Use admin.existingSecret for Grafana creds from Doppler
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Successful in 20m42s
Revert to idiomatic Grafana chart approach. ExternalSecret creates the
secret with admin-user/admin-password keys before Grafana's first start
on fresh cluster creation.
2026-04-01 01:41:49 +00:00
micqdf a1b9fe6aa6 fix: Use Flux valuesFrom to inject Doppler Grafana creds as Helm values
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 20m38s
Switch from admin.existingSecret to valuesFrom so Flux reads the
Doppler-managed secret and injects credentials as standard Helm values.
2026-03-31 23:40:54 +00:00
micqdf 33765657ec fix: Correct pod selectors for Prometheus and Flux Tailscale services, use Doppler for Grafana creds
Deploy Cluster / Terraform (push) Successful in 50s
Deploy Cluster / Ansible (push) Successful in 21m0s
Prometheus needs operator.prometheus.io/name label selector. Flux UI pods
are labeled gitops-server not weave-gitops. Grafana now reads admin creds
from Doppler via ExternalSecret instead of hardcoded values.
2026-03-31 22:54:57 +00:00
micqdf b8f64fa952 feat: Expose Grafana, Prometheus, and Flux UI via Tailscale LoadBalancer services
Deploy Cluster / Terraform (push) Successful in 55s
Deploy Cluster / Ansible (push) Successful in 20m47s
Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB
services matching the existing Rancher pattern. Each service gets its own
tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).
2026-03-31 08:53:28 +00:00
micqdf 569d741751 push
Deploy Cluster / Terraform (push) Successful in 2m37s
Deploy Cluster / Ansible (push) Successful in 25m37s
2026-03-31 02:46:55 +00:00
micqdf 89e53d9ec9 fix: Handle restricted B2 keys and safe JSON parsing in restore step
Deploy Cluster / Terraform (push) Successful in 52s
Deploy Cluster / Ansible (push) Successful in 20m48s
2026-03-31 01:43:04 +00:00