Files
HetznerTerra/NETWORK_STABILIZATION_PLAN.md
T

5.1 KiB

Network Stabilization Plan

Goal

Make destroy/rebuild deploys reliable without hiding real network failures behind runner-side image archives or one-off manual intervention.

Current Symptoms

  • Registry pulls intermittently fail from cluster nodes with TLS handshake timeouts.
  • Failures have appeared across GHCR, Docker Hub, Quay, registry.k8s.io, and redirected blob hosts.
  • Doppler API calls from External Secrets intermittently timeout.
  • Flux OCIRepository objects can show transient upstream failures even when cached artifacts are sufficient for successful Helm releases.
  • Lowering node MTU to 1400 improved kube-vip and some image pulls but did not eliminate the issue.

Working Hypothesis

The remaining instability is likely egress path behavior from the VM subnet, especially PMTUD/MSS/NAT/firewall handling. The same timeout pattern appears across unrelated upstream services, which points away from a single registry, chart, or Kubernetes component.

Phase 1: Prove The Network Root Cause

Run repeatable probes from the Proxmox host, cp1, and one worker.

  • Test registry and API endpoints with repeated curl timing checks.
  • Test known flaky pulls with repeated crictl pull attempts.
  • Test Doppler API reachability from a node.
  • Compare Proxmox host egress against VM egress.
  • Check path MTU behavior with tools such as tracepath where available.
  • Record node MTU, default route, DNS resolver, and selected remote IPs during tests.

Target endpoints:

  • https://ghcr.io/v2/
  • https://auth.docker.io/token
  • https://registry-1.docker.io/v2/
  • https://quay.io/v2/
  • https://registry.k8s.io/v2/
  • https://api.doppler.com/v3/projects

Known useful test images:

  • ghcr.io/fluxcd/helm-controller:v1.5.1
  • oci.external-secrets.io/external-secrets/external-secrets:v2.1.0
  • docker.io/rancher/mirrored-library-busybox:1.37.0
  • ghcr.io/tailscale/tailscale:v1.96.5
  • quay.io/prometheus/node-exporter:v1.8.2

Phase 2: Fix The Network Layer

Prefer a network fix before adding more application-level retries.

  • Verify whether the gateway/firewall allows ICMP fragmentation-needed messages.
  • Add TCP MSS clamping on the gateway/firewall for the Kubernetes VM subnet.
  • Start with an MSS value derived from the working path MTU, then reduce only if tests still fail.
  • Keep VM MTU at 1400 unless tests prove a better value.
  • Re-run the Phase 1 probes after each network change.

Success criteria:

  • Repeated registry token and manifest requests succeed without TLS handshake timeouts.
  • Repeated image pulls succeed from cp1 and at least one worker.
  • Doppler API calls from the cluster succeed consistently enough that External Secrets does not flap for long periods.

Phase 3: Reduce External Registry Dependence

If network fixes do not fully stabilize pulls, add a local registry mirror or pull-through cache on the private network.

  • Run the mirror close to the cluster, reachable from 10.27.27.0/24.
  • Configure K3s/containerd via /etc/rancher/k3s/registries.yaml.
  • Mirror or cache high-risk bootstrap and addon images.
  • Keep direct upstream pulls as fallback, but make the mirror the primary path.

Priority image groups:

  • K3s bootstrap images
  • kube-vip
  • Flux controllers
  • External Secrets
  • Tailscale operator and proxy image
  • Rancher and Rancher support images
  • Traefik
  • cert-manager
  • observability stack images
  • NFS and helper images

Phase 4: Keep Secrets From Blocking The Flux Graph

External Secrets should stay the runtime secret source, but Flux should not require live Doppler validation for unrelated graph progress.

  • Keep ClusterSecretStore application decoupled from Flux health checks.
  • Keep explicit workflow checks for generated Kubernetes Secret objects where bootstrap needs them.
  • Continue using external-secrets.io/force-sync for critical bootstrap secrets.
  • Prefer checking generated Kubernetes secrets over checking live Doppler readiness in broad post-deploy gates.

Phase 5: Tighten Workflow Diagnostics

Keep the current green deploy path, but improve failure output.

  • Print image pull failures grouped by image and node.
  • Print Flux source failures separately from HelmRelease readiness.
  • Print External Secrets and Doppler status only in secret-related gates.
  • Print node MTU, default route, and DNS resolver when registry pulls fail.
  • Treat cached OCI artifacts as acceptable when the dependent HelmRelease is already Ready.
  1. Run Phase 1 probes and capture evidence.
  2. Add or adjust gateway TCP MSS clamping.
  3. Re-run Phase 1 probes and one full destroy/rebuild.
  4. Add a local registry mirror only if registry pulls remain flaky.
  5. Simplify retry-heavy workflow logic after the network path is stable.

Current Mitigations Already In Place

  • Node MTU is set to 1400 by Ansible.
  • Bootstrap image pre-pulls use direct node pulls with retries.
  • Critical bootstrap images are pre-pulled before Flux/addons need them.
  • Doppler store health no longer blocks the Flux graph.
  • Rancher bootstrap secrets are force-synced and checked explicitly.
  • Traefik Helm release has longer timeouts and more retries.
  • Post-deploy health checks verify Flux, Helm releases, storage, and pod health.