Network Stabilization Plan

Goal

Make destroy/rebuild deploys reliable without hiding real network failures behind runner-side image archives or one-off manual intervention.

Current Symptoms

Registry pulls intermittently fail from cluster nodes with TLS handshake timeouts.
Failures have appeared across GHCR, Docker Hub, Quay, registry.k8s.io, and redirected blob hosts.
Doppler API calls from External Secrets intermittently timeout.
Flux OCIRepository objects can show transient upstream failures even when cached artifacts are sufficient for successful Helm releases.
Lowering node MTU to 1400 improved kube-vip and some image pulls but did not eliminate the issue.

Working Hypothesis

The remaining instability is likely egress path behavior from the VM subnet, especially PMTUD/MSS/NAT/firewall handling. The same timeout pattern appears across unrelated upstream services, which points away from a single registry, chart, or Kubernetes component.

Phase 1: Prove The Network Root Cause

Run repeatable probes from the Proxmox host, cp1, and one worker.

Test registry and API endpoints with repeated curl timing checks.
Test known flaky pulls with repeated crictl pull attempts.
Test Doppler API reachability from a node.
Compare Proxmox host egress against VM egress.
Check path MTU behavior with tools such as tracepath where available.
Record node MTU, default route, DNS resolver, and selected remote IPs during tests.

Target endpoints:

https://ghcr.io/v2/
https://auth.docker.io/token
https://registry-1.docker.io/v2/
https://quay.io/v2/
https://registry.k8s.io/v2/
https://api.doppler.com/v3/projects

Known useful test images:

ghcr.io/fluxcd/helm-controller:v1.5.1
oci.external-secrets.io/external-secrets/external-secrets:v2.1.0
docker.io/rancher/mirrored-library-busybox:1.37.0
ghcr.io/tailscale/tailscale:v1.96.5
quay.io/prometheus/node-exporter:v1.8.2

Phase 2: Fix The Network Layer

Prefer a network fix before adding more application-level retries.

Verify whether the gateway/firewall allows ICMP fragmentation-needed messages.
Add TCP MSS clamping on the gateway/firewall for the Kubernetes VM subnet.
Start with an MSS value derived from the working path MTU, then reduce only if tests still fail.
Keep VM MTU at 1400 unless tests prove a better value.
Re-run the Phase 1 probes after each network change.

Success criteria:

Repeated registry token and manifest requests succeed without TLS handshake timeouts.
Repeated image pulls succeed from cp1 and at least one worker.
Doppler API calls from the cluster succeed consistently enough that External Secrets does not flap for long periods.

Phase 3: Reduce External Registry Dependence

If network fixes do not fully stabilize pulls, add a local registry mirror or pull-through cache on the private network.

Run the mirror close to the cluster, reachable from 10.27.27.0/24.
Configure K3s/containerd via /etc/rancher/k3s/registries.yaml.
Mirror or cache high-risk bootstrap and addon images.
Keep direct upstream pulls as fallback, but make the mirror the primary path.

Priority image groups:

K3s bootstrap images
kube-vip
Flux controllers
External Secrets
Tailscale operator and proxy image
Rancher and Rancher support images
Traefik
cert-manager
observability stack images
NFS and helper images

Phase 4: Keep Secrets From Blocking The Flux Graph

External Secrets should stay the runtime secret source, but Flux should not require live Doppler validation for unrelated graph progress.

Keep ClusterSecretStore application decoupled from Flux health checks.
Keep explicit workflow checks for generated Kubernetes Secret objects where bootstrap needs them.
Continue using external-secrets.io/force-sync for critical bootstrap secrets.
Prefer checking generated Kubernetes secrets over checking live Doppler readiness in broad post-deploy gates.

Phase 5: Tighten Workflow Diagnostics

Keep the current green deploy path, but improve failure output.

Print image pull failures grouped by image and node.
Print Flux source failures separately from HelmRelease readiness.
Print External Secrets and Doppler status only in secret-related gates.
Print node MTU, default route, and DNS resolver when registry pulls fail.
Treat cached OCI artifacts as acceptable when the dependent HelmRelease is already Ready.

Recommended Order

Run Phase 1 probes and capture evidence.
Add or adjust gateway TCP MSS clamping.
Re-run Phase 1 probes and one full destroy/rebuild.
Add a local registry mirror only if registry pulls remain flaky.
Simplify retry-heavy workflow logic after the network path is stable.

Current Mitigations Already In Place

Node MTU is set to 1400 by Ansible.
Bootstrap image pre-pulls use direct node pulls with retries.
Critical bootstrap images are pre-pulled before Flux/addons need them.
Doppler store health no longer blocks the Flux graph.
Rancher bootstrap secrets are force-synced and checked explicitly.
Traefik Helm release has longer timeouts and more retries.
Post-deploy health checks verify Flux, Helm releases, storage, and pod health.

5.1 KiB Raw Blame History