The repo-only Kustomization healthCheck change accidentally left the original
wait:true keys in the Rancher and Rancher backup Kustomizations, which broke
the infrastructure kustomize build. Remove the duplicate keys so Flux can
apply the HelmRelease-only health checks cleanly.
These addon Kustomizations were using wait=true, which made Flux treat transient
HelmRepository fetch timeouts as addon failures even when the HelmRelease and
runtime workloads were healthy. Switch the affected Kustomizations to explicit
HelmRelease healthChecks so readiness reflects the actual deployed platform
state instead of repository fetch flakiness.
Rancher needs longer than the chart default 2-minute startup probe budget on
this cluster while it restores local catalogs and finishes API startup. Extend
the startup probe failure threshold so Helm upgrades can complete instead of
restarting the new pod before it becomes ready.
Remove the HelmRelease reset/force annotations from the deploy workflow now
that the cluster can converge on its own. The runtime waits remain, but CI no
longer re-triggers Rancher and NFS churn on every bootstrap attempt.
Flux can leave HelmRelease and Kustomization conditions stale after transient
chart fetch or image pull failures even when the underlying workloads recover.
Switch the deploy workflow to wait on the concrete runtime resources we care
about: the NFS provisioner deployment and StorageClass, Rancher deployment,
webhook, cert-manager issuer/certificate, and the rancher-backup deployment.
Docker Hub TLS handshakes are too flaky to make pre-pulling a hard bootstrap
requirement. Treat image pre-pull as opportunistic and disable Rancher's
managed system-upgrade-controller feature so that image is removed from the
critical install path while Rancher and its webhook converge.
Rancher installs were stalling on transient Docker Hub TLS handshake timeouts
for rancher shell, webhook, and system-upgrade-controller images. Pre-pull the
required images onto all nodes after k3s comes up, extend the Rancher HelmRelease
timeout, and reset/force the Rancher HelmRelease before waiting on addon-rancher
so bootstrap can recover from stale failed remediation state.
When the NFS storage HelmRelease has already entered a failed remediation state,
a plain reconcile request is not enough to clear the stale failure counters.
Send requestedAt, resetAt, and forceAt together so helm-controller retries the
release cleanly before the workflow waits on addon-nfs-storage.
The NFS HelmRelease can remain in a failed state from an earlier bootstrap
attempt even after the backing NFS export is corrected and the pod becomes
healthy. Request a fresh reconcile of the HelmRelease and addon kustomization
before waiting on addon-nfs-storage so the bootstrap step can observe the
recovered state.
The cluster nodes can reach the exported NFS path on 10.27.27.239, not
10.27.27.22. Update the storage addon and repo note so the NFS provisioner
mounts the live export and Flux health checks can converge.
Fresh Proxmox clusters need longer for the Flux controller rollouts and first
GitRepository/Kustomization reconciliations, especially while images are still
being pulled onto the control plane. Increase the bootstrap wait windows so CI
does not fail while the controllers are still converging.
Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the
control-plane node is tainted NoSchedule. Add the matching toleration in both
the checked-in patch manifest and the bootstrap workflow so the controllers can
actually schedule and roll out on cp-1.
The local kube-vip readiness probe used an unquoted jsonpath predicate,
which made kubectl treat Ready as an identifier instead of a string. Use a
quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod
properly before waiting on the API VIP.
The kube-vip DaemonSet is applied before the secondary control planes join,
so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not
exist in the cluster yet. Wait only for the primary node's kube-vip pod and
then verify the VIP is reachable on 6443.
The k3s install script can return non-zero while systemd is still bringing the
service up, especially on worker agents. Do not fail immediately on the
installer command; wait for the service to become active and only emit
install diagnostics if the later readiness check fails.
The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18.
Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap
tracks the actual interface name instead of a guessed template default.
Ubuntu cloud-init returns exit code 2 for some completed boots even when the
status output is 'done'. Treat that as a successful wait state so Ansible can
continue into the package install phase instead of aborting early.
Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot,
which caused the Ansible common role to fail before the control plane could
finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the
final kubeconfig rewrite when no kubeconfig was fetched yet.
The bpg/proxmox provider rejects clone.datastore_id when creating linked
clones. Only pass the target datastore when full clones are enabled so the
linked-clone baseline can provision from template 9000 successfully.
Accept Proxmox API endpoints with or without /api2/json in CI and local
tfvars, and avoid running the dashboards workflow just because its own
workflow file changed during platform migrations.
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
Update the baseline to treat Rancher backup and restore validation as part
of the accepted platform state, and capture the successful live drill run
performed on 2026-04-18.
Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
Drop the Flux UI addon and its Tailscale exposure because the UI lags the
current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed
through the controllers themselves and use Rancher or the flux CLI for access.
The Tailscale cleanup role was deleting reserved service hostnames on later
deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy
nodes from the tailnet. Skip cleanup whenever the current cluster already has
those Tailscale services, while still allowing cleanup on fresh rebuilds.
The deploy pipeline never uses the flux binary after installation, so the
GitHub release download only adds a flaky failure point. Remove the step and
keep the bootstrap path kubectl-only.
Prometheus is exposed on port 9090 through the Tailscale LoadBalancer
service, so the configured external URL and repo docs should match the
actual address users reach after rebuilds.
Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so
stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale
services so operator-managed proxies are provisioned with explicit prod/service
tags from the tailnet policy.
The chart's post-install hook hardcodes rancher/kuberlr-kubectl which
can't download kubectl. Use Flux postRenderers to patch the job image
to bitnami/kubectl at render time.
The chart's post-install hook uses rancher/kuberlr-kubectl which fails
to download kubectl. The SA automountServiceAccountToken is managed
manually, so the hook is unnecessary.
Revert to idiomatic Grafana chart approach. ExternalSecret creates the
secret with admin-user/admin-password keys before Grafana's first start
on fresh cluster creation.
Prometheus needs operator.prometheus.io/name label selector. Flux UI pods
are labeled gitops-server not weave-gitops. Grafana now reads admin creds
from Doppler via ExternalSecret instead of hardcoded values.
Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB
services matching the existing Rancher pattern. Each service gets its own
tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).
- Wait for Rancher and rancher-backup operator to be ready
- Patch default SA in cattle-resources-system (fixes post-install hook failure)
- Clean up failed patch-sa jobs
- Force reconcile rancher-backup HelmRelease
- Find latest backup from B2 using Backblaze API
- Create Restore CR to restore Rancher state from latest backup
- Wait for restore to complete before continuing
The S3 config caused the operator to try downloading kubectl, which fails in the container.
S3 credentials are correctly configured in the Backup CR and ExternalSecret instead.
Rancher now manages its own TLS (no longer tls:external), so it serves
HTTPS on port 443. The Tailscale LoadBalancer needs to expose both
HTTP (80) and HTTPS (443) targeting the corresponding container ports.