# Network Stabilization Plan ## Goal Make destroy/rebuild deploys reliable without hiding real network failures behind runner-side image archives or one-off manual intervention. ## Current Symptoms - Registry pulls intermittently fail from cluster nodes with TLS handshake timeouts. - Failures have appeared across GHCR, Docker Hub, Quay, registry.k8s.io, and redirected blob hosts. - Doppler API calls from External Secrets intermittently timeout. - Flux OCIRepository objects can show transient upstream failures even when cached artifacts are sufficient for successful Helm releases. - Lowering node MTU to 1400 improved kube-vip and some image pulls but did not eliminate the issue. ## Working Hypothesis The remaining instability is likely egress path behavior from the VM subnet, especially PMTUD/MSS/NAT/firewall handling. The same timeout pattern appears across unrelated upstream services, which points away from a single registry, chart, or Kubernetes component. ## Phase 1: Prove The Network Root Cause Run repeatable probes from the Proxmox host, cp1, and one worker. - Test registry and API endpoints with repeated `curl` timing checks. - Test known flaky pulls with repeated `crictl pull` attempts. - Test Doppler API reachability from a node. - Compare Proxmox host egress against VM egress. - Check path MTU behavior with tools such as `tracepath` where available. - Record node MTU, default route, DNS resolver, and selected remote IPs during tests. Target endpoints: - `https://ghcr.io/v2/` - `https://auth.docker.io/token` - `https://registry-1.docker.io/v2/` - `https://quay.io/v2/` - `https://registry.k8s.io/v2/` - `https://api.doppler.com/v3/projects` Known useful test images: - `ghcr.io/fluxcd/helm-controller:v1.5.1` - `oci.external-secrets.io/external-secrets/external-secrets:v2.1.0` - `docker.io/rancher/mirrored-library-busybox:1.37.0` - `ghcr.io/tailscale/tailscale:v1.96.5` - `quay.io/prometheus/node-exporter:v1.8.2` ## Phase 2: Fix The Network Layer Prefer a network fix before adding more application-level retries. - Verify whether the gateway/firewall allows ICMP fragmentation-needed messages. - Add TCP MSS clamping on the gateway/firewall for the Kubernetes VM subnet. - Start with an MSS value derived from the working path MTU, then reduce only if tests still fail. - Keep VM MTU at `1400` unless tests prove a better value. - Re-run the Phase 1 probes after each network change. Success criteria: - Repeated registry token and manifest requests succeed without TLS handshake timeouts. - Repeated image pulls succeed from cp1 and at least one worker. - Doppler API calls from the cluster succeed consistently enough that External Secrets does not flap for long periods. ## Phase 3: Reduce External Registry Dependence If network fixes do not fully stabilize pulls, add a local registry mirror or pull-through cache on the private network. - Run the mirror close to the cluster, reachable from `10.27.27.0/24`. - Configure K3s/containerd via `/etc/rancher/k3s/registries.yaml`. - Mirror or cache high-risk bootstrap and addon images. - Keep direct upstream pulls as fallback, but make the mirror the primary path. Priority image groups: - K3s bootstrap images - kube-vip - Flux controllers - External Secrets - Tailscale operator and proxy image - Rancher and Rancher support images - Traefik - cert-manager - observability stack images - NFS and helper images ## Phase 4: Keep Secrets From Blocking The Flux Graph External Secrets should stay the runtime secret source, but Flux should not require live Doppler validation for unrelated graph progress. - Keep `ClusterSecretStore` application decoupled from Flux health checks. - Keep explicit workflow checks for generated Kubernetes `Secret` objects where bootstrap needs them. - Continue using `external-secrets.io/force-sync` for critical bootstrap secrets. - Prefer checking generated Kubernetes secrets over checking live Doppler readiness in broad post-deploy gates. ## Phase 5: Tighten Workflow Diagnostics Keep the current green deploy path, but improve failure output. - Print image pull failures grouped by image and node. - Print Flux source failures separately from HelmRelease readiness. - Print External Secrets and Doppler status only in secret-related gates. - Print node MTU, default route, and DNS resolver when registry pulls fail. - Treat cached OCI artifacts as acceptable when the dependent HelmRelease is already Ready. ## Recommended Order 1. Run Phase 1 probes and capture evidence. 2. Add or adjust gateway TCP MSS clamping. 3. Re-run Phase 1 probes and one full destroy/rebuild. 4. Add a local registry mirror only if registry pulls remain flaky. 5. Simplify retry-heavy workflow logic after the network path is stable. ## Current Mitigations Already In Place - Node MTU is set to `1400` by Ansible. - Bootstrap image pre-pulls use direct node pulls with retries. - Critical bootstrap images are pre-pulled before Flux/addons need them. - Doppler store health no longer blocks the Flux graph. - Rancher bootstrap secrets are force-synced and checked explicitly. - Traefik Helm release has longer timeouts and more retries. - Post-deploy health checks verify Flux, Helm releases, storage, and pod health.