The NFS HelmRelease can remain in a failed state from an earlier bootstrap
attempt even after the backing NFS export is corrected and the pod becomes
healthy. Request a fresh reconcile of the HelmRelease and addon kustomization
before waiting on addon-nfs-storage so the bootstrap step can observe the
recovered state.
The cluster nodes can reach the exported NFS path on 10.27.27.239, not
10.27.27.22. Update the storage addon and repo note so the NFS provisioner
mounts the live export and Flux health checks can converge.
Fresh Proxmox clusters need longer for the Flux controller rollouts and first
GitRepository/Kustomization reconciliations, especially while images are still
being pulled onto the control plane. Increase the bootstrap wait windows so CI
does not fail while the controllers are still converging.
Flux bootstrap patches the controllers onto k8s-cluster-cp-1, but the
control-plane node is tainted NoSchedule. Add the matching toleration in both
the checked-in patch manifest and the bootstrap workflow so the controllers can
actually schedule and roll out on cp-1.
The local kube-vip readiness probe used an unquoted jsonpath predicate,
which made kubectl treat Ready as an identifier instead of a string. Use a
quoted jsonpath via shell so bootstrap can detect the primary kube-vip pod
properly before waiting on the API VIP.
The kube-vip DaemonSet is applied before the secondary control planes join,
so waiting for a full DaemonSet rollout blocks bootstrap on nodes that do not
exist in the cluster yet. Wait only for the primary node's kube-vip pod and
then verify the VIP is reachable on 6443.
The k3s install script can return non-zero while systemd is still bringing the
service up, especially on worker agents. Do not fail immediately on the
installer command; wait for the service to become active and only emit
install diagnostics if the later readiness check fails.
The Proxmox Ubuntu clones are exposing their primary NIC as eth0, not ens18.
Use ansible_default_ipv4.interface for k3s flannel and kube-vip so bootstrap
tracks the actual interface name instead of a guessed template default.
Ubuntu cloud-init returns exit code 2 for some completed boots even when the
status output is 'done'. Treat that as a successful wait state so Ansible can
continue into the package install phase instead of aborting early.
Fresh Ubuntu cloud-init clones still hold apt and dpkg locks during first boot,
which caused the Ansible common role to fail before the control plane could
finish bootstrap. Wait for cloud-init, increase apt lock timeouts, and skip the
final kubeconfig rewrite when no kubeconfig was fetched yet.
The bpg/proxmox provider rejects clone.datastore_id when creating linked
clones. Only pass the target datastore when full clones are enabled so the
linked-clone baseline can provision from template 9000 successfully.
Accept Proxmox API endpoints with or without /api2/json in CI and local
tfvars, and avoid running the dashboards workflow just because its own
workflow file changed during platform migrations.
Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
Update the baseline to treat Rancher backup and restore validation as part
of the accepted platform state, and capture the successful live drill run
performed on 2026-04-18.
Add a post-deploy smoke test that validates Tailscale DNS, proxy readiness,
reachability, and service responses for Rancher, Grafana, and Prometheus.
Move the operator to the stable Helm repo/version and align the baseline docs
with the current HA private-only architecture.
Drop the Flux UI addon and its Tailscale exposure because the UI lags the
current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed
through the controllers themselves and use Rancher or the flux CLI for access.
The Tailscale cleanup role was deleting reserved service hostnames on later
deploy runs, which removed the live Rancher/Grafana/Prometheus/Flux proxy
nodes from the tailnet. Skip cleanup whenever the current cluster already has
those Tailscale services, while still allowing cleanup on fresh rebuilds.
The deploy pipeline never uses the flux binary after installation, so the
GitHub release download only adds a flaky failure point. Remove the step and
keep the bootstrap path kubectl-only.
Prometheus is exposed on port 9090 through the Tailscale LoadBalancer
service, so the configured external URL and repo docs should match the
actual address users reach after rebuilds.
Reserve grafana/prometheus/flux alongside rancher during rebuild cleanup so
stale tailnet devices do not force -1 hostnames. Tag the exposed Tailscale
services so operator-managed proxies are provisioned with explicit prod/service
tags from the tailnet policy.
The chart's post-install hook hardcodes rancher/kuberlr-kubectl which
can't download kubectl. Use Flux postRenderers to patch the job image
to bitnami/kubectl at render time.
The chart's post-install hook uses rancher/kuberlr-kubectl which fails
to download kubectl. The SA automountServiceAccountToken is managed
manually, so the hook is unnecessary.
Revert to idiomatic Grafana chart approach. ExternalSecret creates the
secret with admin-user/admin-password keys before Grafana's first start
on fresh cluster creation.
Prometheus needs operator.prometheus.io/name label selector. Flux UI pods
are labeled gitops-server not weave-gitops. Grafana now reads admin creds
from Doppler via ExternalSecret instead of hardcoded values.
Replace Ansible port-forwarding + tailscale serve with direct Tailscale LB
services matching the existing Rancher pattern. Each service gets its own
tailnet hostname (grafana/prometheus/flux.silverside-gopher.ts.net).
- Wait for Rancher and rancher-backup operator to be ready
- Patch default SA in cattle-resources-system (fixes post-install hook failure)
- Clean up failed patch-sa jobs
- Force reconcile rancher-backup HelmRelease
- Find latest backup from B2 using Backblaze API
- Create Restore CR to restore Rancher state from latest backup
- Wait for restore to complete before continuing
The S3 config caused the operator to try downloading kubectl, which fails in the container.
S3 credentials are correctly configured in the Backup CR and ExternalSecret instead.
Rancher now manages its own TLS (no longer tls:external), so it serves
HTTPS on port 443. The Tailscale LoadBalancer needs to expose both
HTTP (80) and HTTPS (443) targeting the corresponding container ports.
The Backup and Restore CRs need the rancher-backup CRDs to exist first.
Moved them to a separate kustomization that depends on the operator being ready.
With Tailscale LoadBalancer, TLS is not actually terminated at the edge.
The Tailscale proxy does TCP passthrough, so Rancher must serve its own
TLS certs. Setting tls: external caused Rancher to listen HTTP-only,
which broke HTTPS access through Tailscale.
Rancher 2.x uses embedded etcd, not an external PostgreSQL database.
The CATTLE_DB_CATTLE_* env vars are Rancher v1 only and were ignored.
- Remove all CNPG (CloudNativePG) cluster, operator, and related configs
- Remove external DB env vars from Rancher HelmRelease
- Remove rancher-db-password ExternalSecret
- Add rancher-backup operator HelmRelease (v106.0.2+up8.1.0)
- Add B2 credentials ExternalSecret for backup storage
- Add recurring Backup CR (daily at 03:00, 7 day retention)
- Add commented-out Restore CR for rebuild recovery
- Update Flux dependency graph accordingly
Adds tailscale-cleanup Ansible role that uses the Tailscale API to
delete offline devices matching reserved hostnames (e.g. rancher).
Runs during site.yml before Finalize to prevent hostname collisions
like rancher-1 on rebuild.
Requires TAILSCALE_API_KEY (API access token) passed as extra var.
- scripts/refresh-kubeconfig.sh fetches a fresh kubeconfig from CP1
- Ansible site.yml Finalize step now uses public IP instead of Tailscale
hostname for the kubeconfig server address
- Updated AGENTS.md with kubeconfig refresh instructions