From 8375333ac58609cbbb4327d392cd2629dc8e6f4c Mon Sep 17 00:00:00 2001 From: MichaelFisher1997 Date: Sat, 2 May 2026 22:20:52 +0000 Subject: [PATCH] docs: add network stabilization plan --- AGENTS.md | 1 - NETWORK_STABILIZATION_PLAN.md | 120 ++++++++++++++++++++++++++++++++++ STABLE_BASELINE.md | 73 --------------------- 3 files changed, 120 insertions(+), 74 deletions(-) create mode 100644 NETWORK_STABILIZATION_PLAN.md delete mode 100644 STABLE_BASELINE.md diff --git a/AGENTS.md b/AGENTS.md index 56aee71..f5c973b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -5,7 +5,6 @@ Compact repo guidance for OpenCode sessions. Trust executable sources over docs ## Read First - Highest-value sources: `.gitea/workflows/deploy.yml`, `.gitea/workflows/destroy.yml`, `terraform/main.tf`, `terraform/variables.tf`, `terraform/servers.tf`, `ansible/site.yml`, `ansible/inventory.tmpl`, `clusters/prod/flux-system/`, `infrastructure/addons/kustomization.yaml`. -- `STABLE_BASELINE.md` still contains stale Rancher backup/restore references; current workflows and addon manifests do not deploy or restore `rancher-backup`. ## Baseline diff --git a/NETWORK_STABILIZATION_PLAN.md b/NETWORK_STABILIZATION_PLAN.md new file mode 100644 index 0000000..731023a --- /dev/null +++ b/NETWORK_STABILIZATION_PLAN.md @@ -0,0 +1,120 @@ +# Network Stabilization Plan + +## Goal + +Make destroy/rebuild deploys reliable without hiding real network failures behind runner-side image archives or one-off manual intervention. + +## Current Symptoms + +- Registry pulls intermittently fail from cluster nodes with TLS handshake timeouts. +- Failures have appeared across GHCR, Docker Hub, Quay, registry.k8s.io, and redirected blob hosts. +- Doppler API calls from External Secrets intermittently timeout. +- Flux OCIRepository objects can show transient upstream failures even when cached artifacts are sufficient for successful Helm releases. +- Lowering node MTU to 1400 improved kube-vip and some image pulls but did not eliminate the issue. + +## Working Hypothesis + +The remaining instability is likely egress path behavior from the VM subnet, especially PMTUD/MSS/NAT/firewall handling. The same timeout pattern appears across unrelated upstream services, which points away from a single registry, chart, or Kubernetes component. + +## Phase 1: Prove The Network Root Cause + +Run repeatable probes from the Proxmox host, cp1, and one worker. + +- Test registry and API endpoints with repeated `curl` timing checks. +- Test known flaky pulls with repeated `crictl pull` attempts. +- Test Doppler API reachability from a node. +- Compare Proxmox host egress against VM egress. +- Check path MTU behavior with tools such as `tracepath` where available. +- Record node MTU, default route, DNS resolver, and selected remote IPs during tests. + +Target endpoints: + +- `https://ghcr.io/v2/` +- `https://auth.docker.io/token` +- `https://registry-1.docker.io/v2/` +- `https://quay.io/v2/` +- `https://registry.k8s.io/v2/` +- `https://api.doppler.com/v3/projects` + +Known useful test images: + +- `ghcr.io/fluxcd/helm-controller:v1.5.1` +- `oci.external-secrets.io/external-secrets/external-secrets:v2.1.0` +- `docker.io/rancher/mirrored-library-busybox:1.37.0` +- `ghcr.io/tailscale/tailscale:v1.96.5` +- `quay.io/prometheus/node-exporter:v1.8.2` + +## Phase 2: Fix The Network Layer + +Prefer a network fix before adding more application-level retries. + +- Verify whether the gateway/firewall allows ICMP fragmentation-needed messages. +- Add TCP MSS clamping on the gateway/firewall for the Kubernetes VM subnet. +- Start with an MSS value derived from the working path MTU, then reduce only if tests still fail. +- Keep VM MTU at `1400` unless tests prove a better value. +- Re-run the Phase 1 probes after each network change. + +Success criteria: + +- Repeated registry token and manifest requests succeed without TLS handshake timeouts. +- Repeated image pulls succeed from cp1 and at least one worker. +- Doppler API calls from the cluster succeed consistently enough that External Secrets does not flap for long periods. + +## Phase 3: Reduce External Registry Dependence + +If network fixes do not fully stabilize pulls, add a local registry mirror or pull-through cache on the private network. + +- Run the mirror close to the cluster, reachable from `10.27.27.0/24`. +- Configure K3s/containerd via `/etc/rancher/k3s/registries.yaml`. +- Mirror or cache high-risk bootstrap and addon images. +- Keep direct upstream pulls as fallback, but make the mirror the primary path. + +Priority image groups: + +- K3s bootstrap images +- kube-vip +- Flux controllers +- External Secrets +- Tailscale operator and proxy image +- Rancher and Rancher support images +- Traefik +- cert-manager +- observability stack images +- NFS and helper images + +## Phase 4: Keep Secrets From Blocking The Flux Graph + +External Secrets should stay the runtime secret source, but Flux should not require live Doppler validation for unrelated graph progress. + +- Keep `ClusterSecretStore` application decoupled from Flux health checks. +- Keep explicit workflow checks for generated Kubernetes `Secret` objects where bootstrap needs them. +- Continue using `external-secrets.io/force-sync` for critical bootstrap secrets. +- Prefer checking generated Kubernetes secrets over checking live Doppler readiness in broad post-deploy gates. + +## Phase 5: Tighten Workflow Diagnostics + +Keep the current green deploy path, but improve failure output. + +- Print image pull failures grouped by image and node. +- Print Flux source failures separately from HelmRelease readiness. +- Print External Secrets and Doppler status only in secret-related gates. +- Print node MTU, default route, and DNS resolver when registry pulls fail. +- Treat cached OCI artifacts as acceptable when the dependent HelmRelease is already Ready. + +## Recommended Order + +1. Run Phase 1 probes and capture evidence. +2. Add or adjust gateway TCP MSS clamping. +3. Re-run Phase 1 probes and one full destroy/rebuild. +4. Add a local registry mirror only if registry pulls remain flaky. +5. Simplify retry-heavy workflow logic after the network path is stable. + +## Current Mitigations Already In Place + +- Node MTU is set to `1400` by Ansible. +- Bootstrap image pre-pulls use direct node pulls with retries. +- Critical bootstrap images are pre-pulled before Flux/addons need them. +- Doppler store health no longer blocks the Flux graph. +- Rancher bootstrap secrets are force-synced and checked explicitly. +- Traefik Helm release has longer timeouts and more retries. +- Post-deploy health checks verify Flux, Helm releases, storage, and pod health. diff --git a/STABLE_BASELINE.md b/STABLE_BASELINE.md deleted file mode 100644 index d6889cf..0000000 --- a/STABLE_BASELINE.md +++ /dev/null @@ -1,73 +0,0 @@ -# Stable Private-Only Baseline - -This document defines the current engineering target for this repository. - -## Topology - -- 3 control planes (HA etcd cluster) -- 5 workers -- kube-vip API VIP (`10.27.27.40`) -- private Proxmox/LAN network (`10.27.27.0/24`) -- Tailscale operator access and service exposure -- Rancher exposed through Tailscale (`rancher.silverside-gopher.ts.net`) -- Grafana exposed through Tailscale (`grafana.silverside-gopher.ts.net`) -- Prometheus exposed through Tailscale (`prometheus.silverside-gopher.ts.net:9090`) -- `apps` Kustomization suspended by default - -## In Scope - -- Terraform infrastructure bootstrap -- Ansible k3s bootstrap on Ubuntu cloud-init VMs -- **HA control plane (3 nodes with etcd quorum)** -- **kube-vip for Kubernetes API HA** -- **NFS-backed persistent volumes via `nfs-subdir-external-provisioner`** -- Flux core reconciliation -- External Secrets Operator with Doppler -- Tailscale private access and smoke-check validation -- cert-manager -- Rancher and rancher-backup -- Rancher backup/restore validation -- Observability stack (Grafana, Prometheus, Loki, Promtail) -- Persistent volume provisioning validated - -## Deferred for Later Phases - -- app workloads in `apps/` - -## Out of Scope - -- public ingress or DNS -- public TLS -- app workloads -- cross-region / multi-cluster disaster recovery strategy -- upgrade strategy - -## Phase Gates - -1. Terraform apply completes for HA topology (3 CP, 5 workers, 1 VIP). -2. Primary control plane bootstraps with `--cluster-init`. -3. kube-vip advertises `10.27.27.40:6443` from the control-plane set. -4. Secondary control planes join via the kube-vip endpoint. -5. Workers join successfully via the kube-vip endpoint. -7. etcd reports 3 healthy members. -8. Flux source and infrastructure reconciliation are healthy. -9. **NFS provisioner deploys and creates `flash-nfs` StorageClass**. -10. **PVC provisioning tested and working**. -11. External Secrets sync required secrets. -12. Tailscale private access works for Rancher, Grafana, and Prometheus. -13. CI smoke checks pass for Tailscale DNS resolution, `tailscale ping`, and HTTP reachability. -14. A fresh Rancher backup can be created and restored successfully. -15. Terraform destroy succeeds cleanly or via workflow retry. - -## Success Criteria - -Success requires two consecutive HA rebuilds passing all phase gates with no manual fixes, no manual `kubectl` patching, and no manual Tailscale proxy recreation. - -## Validated Drills - -- 2026-04-18: live Rancher backup/restore drill succeeded on the current cluster. -- A fresh one-time backup was created, restored back onto the same cluster, and post-restore validation confirmed: - - all nodes remained `Ready` - - Flux infrastructure stayed healthy - - Rancher backup/restore resources reported `Completed` - - Rancher, Grafana, and Prometheus remained reachable through the Tailscale smoke checks