docs: add network stabilization plan

2026-05-02 22:20:52 +00:00
parent 2636f14408
commit 8375333ac5
3 changed files with 120 additions and 74 deletions
@@ -5,7 +5,6 @@ Compact repo guidance for OpenCode sessions. Trust executable sources over docs
 ## Read First

 - Highest-value sources: `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `terraform/main.tf`, `terraform/variables.tf`, `terraform/servers.tf`, `ansible/site.yml`, `ansible/inventory.tmpl`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`.
- `STABLE_BASELINE.md` still contains stale Rancher backup/restore references; current workflows and addon manifests do not deploy or restore `rancher-backup`.

 ## Baseline

@@ -0,0 +1,120 @@
+# Network Stabilization Plan
+
+## Goal
+
+Make destroy/rebuild deploys reliable without hiding real network failures behind runner-side image archives or one-off manual intervention.
+
+## Current Symptoms
+
+- Registry pulls intermittently fail from cluster nodes with TLS handshake timeouts.
+- Failures have appeared across GHCR, Docker Hub, Quay, registry.k8s.io, and redirected blob hosts.
+- Doppler API calls from External Secrets intermittently timeout.
+- Flux OCIRepository objects can show transient upstream failures even when cached artifacts are sufficient for successful Helm releases.
+- Lowering node MTU to 1400 improved kube-vip and some image pulls but did not eliminate the issue.
+
+## Working Hypothesis
+
+The remaining instability is likely egress path behavior from the VM subnet, especially PMTUD/MSS/NAT/firewall handling. The same timeout pattern appears across unrelated upstream services, which points away from a single registry, chart, or Kubernetes component.
+
+## Phase 1: Prove The Network Root Cause
+
+Run repeatable probes from the Proxmox host, cp1, and one worker.
+
+- Test registry and API endpoints with repeated `curl` timing checks.
+- Test known flaky pulls with repeated `crictl pull` attempts.
+- Test Doppler API reachability from a node.
+- Compare Proxmox host egress against VM egress.
+- Check path MTU behavior with tools such as `tracepath` where available.
+- Record node MTU, default route, DNS resolver, and selected remote IPs during tests.
+
+Target endpoints:
+
+- `https://ghcr.io/v2/`
+- `https://auth.docker.io/token`
+- `https://registry-1.docker.io/v2/`
+- `https://quay.io/v2/`
+- `https://registry.k8s.io/v2/`
+- `https://api.doppler.com/v3/projects`
+
+Known useful test images:
+
+- `ghcr.io/fluxcd/helm-controller:v1.5.1`
+- `oci.external-secrets.io/external-secrets/external-secrets:v2.1.0`
+- `docker.io/rancher/mirrored-library-busybox:1.37.0`
+- `ghcr.io/tailscale/tailscale:v1.96.5`
+- `quay.io/prometheus/node-exporter:v1.8.2`
+
+## Phase 2: Fix The Network Layer
+
+Prefer a network fix before adding more application-level retries.
+
+- Verify whether the gateway/firewall allows ICMP fragmentation-needed messages.
+- Add TCP MSS clamping on the gateway/firewall for the Kubernetes VM subnet.
+- Start with an MSS value derived from the working path MTU, then reduce only if tests still fail.
+- Keep VM MTU at `1400` unless tests prove a better value.
+- Re-run the Phase 1 probes after each network change.
+
+Success criteria:
+
+- Repeated registry token and manifest requests succeed without TLS handshake timeouts.
+- Repeated image pulls succeed from cp1 and at least one worker.
+- Doppler API calls from the cluster succeed consistently enough that External Secrets does not flap for long periods.
+
+## Phase 3: Reduce External Registry Dependence
+
+If network fixes do not fully stabilize pulls, add a local registry mirror or pull-through cache on the private network.
+
+- Run the mirror close to the cluster, reachable from `10.27.27.0/24`.
+- Configure K3s/containerd via `/etc/rancher/k3s/registries.yaml`.
+- Mirror or cache high-risk bootstrap and addon images.
+- Keep direct upstream pulls as fallback, but make the mirror the primary path.
+
+Priority image groups:
+
+- K3s bootstrap images
+- kube-vip
+- Flux controllers
+- External Secrets
+- Tailscale operator and proxy image
+- Rancher and Rancher support images
+- Traefik
+- cert-manager
+- observability stack images
+- NFS and helper images
+
+## Phase 4: Keep Secrets From Blocking The Flux Graph
+
+External Secrets should stay the runtime secret source, but Flux should not require live Doppler validation for unrelated graph progress.
+
+- Keep `ClusterSecretStore` application decoupled from Flux health checks.
+- Keep explicit workflow checks for generated Kubernetes `Secret` objects where bootstrap needs them.
+- Continue using `external-secrets.io/force-sync` for critical bootstrap secrets.
+- Prefer checking generated Kubernetes secrets over checking live Doppler readiness in broad post-deploy gates.
+
+## Phase 5: Tighten Workflow Diagnostics
+
+Keep the current green deploy path, but improve failure output.
+
+- Print image pull failures grouped by image and node.
+- Print Flux source failures separately from HelmRelease readiness.
+- Print External Secrets and Doppler status only in secret-related gates.
+- Print node MTU, default route, and DNS resolver when registry pulls fail.
+- Treat cached OCI artifacts as acceptable when the dependent HelmRelease is already Ready.
+
+## Recommended Order
+
+1. Run Phase 1 probes and capture evidence.
+2. Add or adjust gateway TCP MSS clamping.
+3. Re-run Phase 1 probes and one full destroy/rebuild.
+4. Add a local registry mirror only if registry pulls remain flaky.
+5. Simplify retry-heavy workflow logic after the network path is stable.
+
+## Current Mitigations Already In Place
+
+- Node MTU is set to `1400` by Ansible.
+- Bootstrap image pre-pulls use direct node pulls with retries.
+- Critical bootstrap images are pre-pulled before Flux/addons need them.
+- Doppler store health no longer blocks the Flux graph.
+- Rancher bootstrap secrets are force-synced and checked explicitly.
+- Traefik Helm release has longer timeouts and more retries.
+- Post-deploy health checks verify Flux, Helm releases, storage, and pod health.
@@ -1,73 +0,0 @@
-# Stable Private-Only Baseline
-
-This document defines the current engineering target for this repository.
-
-## Topology
-
- 3 control planes (HA etcd cluster)
- 5 workers
- kube-vip API VIP (`10.27.27.40`)
- private Proxmox/LAN network (`10.27.27.0/24`)
- Tailscale operator access and service exposure
- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`)
- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`)
- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`)
- `apps` Kustomization suspended by default
-
-## In Scope
-
- Terraform infrastructure bootstrap
- Ansible k3s bootstrap on Ubuntu cloud-init VMs
- **HA control plane (3 nodes with etcd quorum)**
- **kube-vip for Kubernetes API HA**
- **NFS-backed persistent volumes via `nfs-subdir-external-provisioner`**
- Flux core reconciliation
- External Secrets Operator with Doppler
- Tailscale private access and smoke-check validation
- cert-manager
- Rancher and rancher-backup
- Rancher backup/restore validation
- Observability stack (Grafana, Prometheus, Loki, Promtail)
- Persistent volume provisioning validated
-
-## Deferred for Later Phases
-
- app workloads in `apps/`
-
-## Out of Scope
-
- public ingress or DNS
- public TLS
- app workloads
- cross-region / multi-cluster disaster recovery strategy
- upgrade strategy
-
-## Phase Gates
-
-1. Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP).
-2. Primary control plane bootstraps with `--cluster-init`.
-3. kube-vip advertises `10.27.27.40:6443` from the control-plane set.
-4. Secondary control planes join via the kube-vip endpoint.
-5. Workers join successfully via the kube-vip endpoint.
-7. etcd reports 3 healthy members.
-8. Flux source and infrastructure reconciliation are healthy.
-9. **NFS provisioner deploys and creates `flash-nfs` StorageClass**.
-10. **PVC provisioning tested and working**.
-11. External Secrets sync required secrets.
-12. Tailscale private access works for Rancher, Grafana, and Prometheus.
-13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability.
-14. A fresh Rancher backup can be created and restored successfully.
-15. Terraform destroy succeeds cleanly or via workflow retry.
-
-## Success Criteria
-
-Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation.
-
-## Validated Drills
-
- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster.
- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed:
-  - all nodes remained `Ready`
-  - Flux infrastructure stayed healthy
-  - Rancher backup/restore resources reported `Completed`
-  - Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks