docs: add network stabilization plan
This commit is contained in:
@@ -0,0 +1,120 @@
|
||||
# Network Stabilization Plan
|
||||
|
||||
## Goal
|
||||
|
||||
Make destroy/rebuild deploys reliable without hiding real network failures behind runner-side image archives or one-off manual intervention.
|
||||
|
||||
## Current Symptoms
|
||||
|
||||
- Registry pulls intermittently fail from cluster nodes with TLS handshake timeouts.
|
||||
- Failures have appeared across GHCR, Docker Hub, Quay, registry.k8s.io, and redirected blob hosts.
|
||||
- Doppler API calls from External Secrets intermittently timeout.
|
||||
- Flux OCIRepository objects can show transient upstream failures even when cached artifacts are sufficient for successful Helm releases.
|
||||
- Lowering node MTU to 1400 improved kube-vip and some image pulls but did not eliminate the issue.
|
||||
|
||||
## Working Hypothesis
|
||||
|
||||
The remaining instability is likely egress path behavior from the VM subnet, especially PMTUD/MSS/NAT/firewall handling. The same timeout pattern appears across unrelated upstream services, which points away from a single registry, chart, or Kubernetes component.
|
||||
|
||||
## Phase 1: Prove The Network Root Cause
|
||||
|
||||
Run repeatable probes from the Proxmox host, cp1, and one worker.
|
||||
|
||||
- Test registry and API endpoints with repeated `curl` timing checks.
|
||||
- Test known flaky pulls with repeated `crictl pull` attempts.
|
||||
- Test Doppler API reachability from a node.
|
||||
- Compare Proxmox host egress against VM egress.
|
||||
- Check path MTU behavior with tools such as `tracepath` where available.
|
||||
- Record node MTU, default route, DNS resolver, and selected remote IPs during tests.
|
||||
|
||||
Target endpoints:
|
||||
|
||||
- `https://ghcr.io/v2/`
|
||||
- `https://auth.docker.io/token`
|
||||
- `https://registry-1.docker.io/v2/`
|
||||
- `https://quay.io/v2/`
|
||||
- `https://registry.k8s.io/v2/`
|
||||
- `https://api.doppler.com/v3/projects`
|
||||
|
||||
Known useful test images:
|
||||
|
||||
- `ghcr.io/fluxcd/helm-controller:v1.5.1`
|
||||
- `oci.external-secrets.io/external-secrets/external-secrets:v2.1.0`
|
||||
- `docker.io/rancher/mirrored-library-busybox:1.37.0`
|
||||
- `ghcr.io/tailscale/tailscale:v1.96.5`
|
||||
- `quay.io/prometheus/node-exporter:v1.8.2`
|
||||
|
||||
## Phase 2: Fix The Network Layer
|
||||
|
||||
Prefer a network fix before adding more application-level retries.
|
||||
|
||||
- Verify whether the gateway/firewall allows ICMP fragmentation-needed messages.
|
||||
- Add TCP MSS clamping on the gateway/firewall for the Kubernetes VM subnet.
|
||||
- Start with an MSS value derived from the working path MTU, then reduce only if tests still fail.
|
||||
- Keep VM MTU at `1400` unless tests prove a better value.
|
||||
- Re-run the Phase 1 probes after each network change.
|
||||
|
||||
Success criteria:
|
||||
|
||||
- Repeated registry token and manifest requests succeed without TLS handshake timeouts.
|
||||
- Repeated image pulls succeed from cp1 and at least one worker.
|
||||
- Doppler API calls from the cluster succeed consistently enough that External Secrets does not flap for long periods.
|
||||
|
||||
## Phase 3: Reduce External Registry Dependence
|
||||
|
||||
If network fixes do not fully stabilize pulls, add a local registry mirror or pull-through cache on the private network.
|
||||
|
||||
- Run the mirror close to the cluster, reachable from `10.27.27.0/24`.
|
||||
- Configure K3s/containerd via `/etc/rancher/k3s/registries.yaml`.
|
||||
- Mirror or cache high-risk bootstrap and addon images.
|
||||
- Keep direct upstream pulls as fallback, but make the mirror the primary path.
|
||||
|
||||
Priority image groups:
|
||||
|
||||
- K3s bootstrap images
|
||||
- kube-vip
|
||||
- Flux controllers
|
||||
- External Secrets
|
||||
- Tailscale operator and proxy image
|
||||
- Rancher and Rancher support images
|
||||
- Traefik
|
||||
- cert-manager
|
||||
- observability stack images
|
||||
- NFS and helper images
|
||||
|
||||
## Phase 4: Keep Secrets From Blocking The Flux Graph
|
||||
|
||||
External Secrets should stay the runtime secret source, but Flux should not require live Doppler validation for unrelated graph progress.
|
||||
|
||||
- Keep `ClusterSecretStore` application decoupled from Flux health checks.
|
||||
- Keep explicit workflow checks for generated Kubernetes `Secret` objects where bootstrap needs them.
|
||||
- Continue using `external-secrets.io/force-sync` for critical bootstrap secrets.
|
||||
- Prefer checking generated Kubernetes secrets over checking live Doppler readiness in broad post-deploy gates.
|
||||
|
||||
## Phase 5: Tighten Workflow Diagnostics
|
||||
|
||||
Keep the current green deploy path, but improve failure output.
|
||||
|
||||
- Print image pull failures grouped by image and node.
|
||||
- Print Flux source failures separately from HelmRelease readiness.
|
||||
- Print External Secrets and Doppler status only in secret-related gates.
|
||||
- Print node MTU, default route, and DNS resolver when registry pulls fail.
|
||||
- Treat cached OCI artifacts as acceptable when the dependent HelmRelease is already Ready.
|
||||
|
||||
## Recommended Order
|
||||
|
||||
1. Run Phase 1 probes and capture evidence.
|
||||
2. Add or adjust gateway TCP MSS clamping.
|
||||
3. Re-run Phase 1 probes and one full destroy/rebuild.
|
||||
4. Add a local registry mirror only if registry pulls remain flaky.
|
||||
5. Simplify retry-heavy workflow logic after the network path is stable.
|
||||
|
||||
## Current Mitigations Already In Place
|
||||
|
||||
- Node MTU is set to `1400` by Ansible.
|
||||
- Bootstrap image pre-pulls use direct node pulls with retries.
|
||||
- Critical bootstrap images are pre-pulled before Flux/addons need them.
|
||||
- Doppler store health no longer blocks the Flux graph.
|
||||
- Rancher bootstrap secrets are force-synced and checked explicitly.
|
||||
- Traefik Helm release has longer timeouts and more retries.
|
||||
- Post-deploy health checks verify Flux, Helm releases, storage, and pod health.
|
||||
Reference in New Issue
Block a user