docs

2026-04-30 07:03:21 +00:00
parent f49b08f50c
commit f52e657f9f
3 changed files with 221 additions and 347 deletions
@@ -1,291 +1,268 @@
 # Proxmox Kubernetes Cluster

-Production-ready private Kubernetes cluster on Proxmox using Terraform, Ansible, and Flux.
+Private HA K3s cluster on Proxmox, provisioned by Terraform, bootstrapped by Ansible, and reconciled by Flux.

 ## Architecture

-| Component | Details |
-|-----------|---------|
-| **Control Plane** | 3x Proxmox VMs (2 vCPU / 4 GiB / 32 GiB) |
-| **Workers** | 5x Proxmox VMs (4 vCPU / 8 GiB / 64 GiB) |
-| **K8s** | k3s (latest, HA) |
-| **Addons** | NFS provisioner + Prometheus + Grafana + Loki + Rancher |
-| **Access** | SSH/API and private services restricted to Tailnet |
-| **Bootstrap** | Terraform + Ansible + Flux |
+| Component | Current Baseline |
+|-----------|------------------|
+| **Control plane** | 3 Proxmox VMs, VMIDs `200-202`, IPs `10.27.27.30-32`, 2 vCPU / 4 GiB / 32 GiB |
+| **Workers** | 5 Proxmox VMs, VMIDs `210-214`, IPs `10.27.27.41-45`, 4 vCPU / 8 GiB / 64 GiB |
+| **Kubernetes** | K3s `v1.34.6+k3s1`, HA embedded etcd, kube-vip API VIP `10.27.27.40` |
+| **Proxmox** | Node `flex`, template VMID `9000`, datastore `Flash`, bridge `vmbr0` |
+| **Storage** | Raw-manifest `nfs-subdir-external-provisioner`, `10.27.27.239:/TheFlash/k8s-nfs`, default StorageClass `flash-nfs` |
+| **GitOps** | Flux source `platform` on branch `main`; `apps` Kustomization is intentionally suspended |
+| **Private access** | Tailscale operator exposes Rancher, Grafana, and Prometheus; no public ingress baseline |
+| **Runtime secrets** | Doppler service token bootstraps External Secrets Operator |
+
+K3s is pinned because Rancher chart `2.13.3` requires Kubernetes `<1.35.0-0`.

 ## Prerequisites

-### 1. Proxmox API Token
+- Terraform `>= 1.0`.
+- Ansible with Python `jinja2` and `pyyaml`.
+- `kubectl` for local verification.
+- Proxmox API token for the `bpg/proxmox` provider.
+- S3-compatible bucket for Terraform state, currently Backblaze B2.
+- SSH key pair available to Terraform and Ansible, defaulting to `~/.ssh/infra` and `~/.ssh/infra.pub`.

-Create an API token for the Proxmox VE user used by Terraform. The repo expects the `bpg/proxmox` provider with:
+Expected Proxmox inputs:

- endpoint: `https://100.105.0.115:8006/`
- node: `flex`
- clone source: template `9000` (`ubuntu-2404-k8s-template`)
- auth: API token
+| Setting | Value |
+|---------|-------|
+| Endpoint | `https://100.105.0.115:8006/` |
+| Node | `flex` |
+| Clone source | Template VMID `9000` (`ubuntu-2404-k8s-template`) |
+| Storage | `Flash` |

-### 2. Backblaze B2 Bucket (for Terraform State)
+## Local Setup

-1. Go to [Backblaze B2](https://secure.backblaze.com/b2_buckets.htm)
-2. Click **Create a Bucket**
-3. Set bucket name: `k8s-terraform-state` (must be globally unique)
-4. Choose **Private** access
-5. Click **Create Bucket**
-6. Create application key:
-   - Go to **App Keys** → **Add a New Application Key**
-   - Name: `terraform-state`
-   - Allow access to: `k8s-terraform-state` bucket only
-   - Type: **Read and Write**
-   - Copy **keyID** (access key) and **applicationKey** (secret key)
-7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
-
-### 3. SSH Key Pair
-
-```bash
-ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra
-```
-
-### 4. Local Tools
-
- [Terraform](https://terraform.io/downloads) >= 1.0
- [Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
-
-## Setup
-
-### 1. Clone Repository
-
-```bash
-git clone <your-gitea-repo>/HetznerTerra.git
-cd HetznerTerra
-```
-
-### 2. Configure Variables
+Create local variables from the example:

 ```bash
 cp terraform.tfvars.example terraform.tfvars
 ```

-Edit `terraform.tfvars`:
+Important defaults in `terraform.tfvars.example`:

 ```hcl
- proxmox_endpoint         = "https://100.105.0.115:8006/"
- proxmox_api_token_id     = "terraform-prov@pve!k8s-cluster"
- proxmox_api_token_secret = "your-proxmox-token-secret"
+proxmox_endpoint         = "https://100.105.0.115:8006/"
+proxmox_api_token_id     = "terraform-prov@pve!k8s-cluster"
+proxmox_api_token_secret = "your-proxmox-api-token-secret"

- ssh_public_key  = "~/.ssh/infra.pub"
- ssh_private_key = "~/.ssh/infra"
+ssh_public_key  = "~/.ssh/infra.pub"
+ssh_private_key = "~/.ssh/infra"

 s3_access_key = "your-backblaze-key-id"
 s3_secret_key = "your-backblaze-application-key"
 s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
 s3_bucket     = "k8s-terraform-state"

-tailscale_auth_key = "tskey-auth-..."
-tailscale_tailnet  = "yourtailnet.ts.net"
-
- kube_api_vip = "10.27.27.40"
+tailscale_tailnet = "yourtailnet.ts.net"
+kube_api_vip     = "10.27.27.40"
 ```

-### 3. Initialize Terraform
+Initialize Terraform with backend credentials:

 ```bash
-cd terraform
-
-# Create backend config file (or use CLI args)
-cat > backend.hcl << EOF
-endpoint                    = "https://s3.eu-central-003.backblazeb2.com"
-bucket                      = "k8s-terraform-state"
-access_key                  = "your-backblaze-key-id"
-secret_key                  = "your-backblaze-application-key"
-skip_requesting_account_id  = true
-EOF
-
-terraform init -backend-config=backend.hcl
+terraform -chdir=terraform init \
+  -backend-config="endpoint=<s3-endpoint>" \
+  -backend-config="bucket=<s3-bucket>" \
+  -backend-config="region=auto" \
+  -backend-config="access_key=<s3-access-key>" \
+  -backend-config="secret_key=<s3-secret-key>" \
+  -backend-config="skip_requesting_account_id=true"
 ```

-### 4. Plan and Apply
+## Common Commands
+
+Terraform:

 ```bash
-terraform plan -var-file=../terraform.tfvars
-terraform apply -var-file=../terraform.tfvars
+terraform -chdir=terraform fmt -recursive
+terraform -chdir=terraform validate
+terraform -chdir=terraform plan -var-file=../terraform.tfvars
+terraform -chdir=terraform apply -var-file=../terraform.tfvars
 ```

-### 5. Generate Ansible Inventory
+Ansible setup:

 ```bash
-cd ../ansible
+ansible-galaxy collection install -r ansible/requirements.yml
+cd ansible
 python3 generate_inventory.py
+ansible-playbook site.yml --syntax-check
 ```

-### 6. Bootstrap Cluster
+Manual Ansible bootstrap uses the same extra vars as the deploy workflow:

 ```bash
-ansible-playbook site.yml
+cd ansible
+ansible-playbook site.yml \
+  -e "tailscale_auth_key=$TAILSCALE_AUTH_KEY" \
+  -e "tailscale_tailnet=$TAILSCALE_TAILNET" \
+  -e "tailscale_oauth_client_id=$TAILSCALE_OAUTH_CLIENT_ID" \
+  -e "tailscale_oauth_client_secret=$TAILSCALE_OAUTH_CLIENT_SECRET" \
+  -e "doppler_hetznerterra_service_token=$DOPPLER_HETZNERTERRA_SERVICE_TOKEN" \
+  -e "tailscale_api_key=${TAILSCALE_API_KEY:-}" \
+  -e "grafana_admin_password=${GRAFANA_ADMIN_PASSWORD:-}" \
+  -e "cluster_name=k8s-cluster"
 ```

-### 7. Get Kubeconfig
+Flux/Kustomize verification:

 ```bash
+kubectl kustomize infrastructure/addons/<addon>
+kubectl kustomize infrastructure/addons
+kubectl kustomize clusters/prod/flux-system
+```
+
+Refresh kubeconfig after rebuilds:
+
+```bash
+scripts/refresh-kubeconfig.sh 10.27.27.30
 export KUBECONFIG=$(pwd)/outputs/kubeconfig
 kubectl get nodes
 ```

-Use `scripts/refresh-kubeconfig.sh <cp1-public-ip>` to refresh kubeconfig against the primary control-plane public IP after rebuilds.
+Run the tailnet smoke check from cp1:
+
+```bash
+ssh ubuntu@10.27.27.30 'bash -s' < scripts/smoke-check-tailnet-services.sh
+```

 ## Gitea CI/CD

-This repository includes Gitea workflows for:
+The supported full rebuild path is the Gitea deploy workflow.

- **deploy**: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
- **destroy**: Cluster teardown with backup-aware cleanup
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
+| Workflow | Trigger | Purpose |
+|----------|---------|---------|
+| `.gitea/workflows/deploy.yml` | PR to `main`, push to `main`, manual dispatch | PRs run Terraform plan; pushes run Terraform apply, Ansible bootstrap, Flux bootstrap, addon gates, health checks, and tailnet smoke checks |
+| `.gitea/workflows/destroy.yml` | Manual dispatch with `confirm: destroy` | Terraform destroy with retries; no Rancher backup gate |
+| `.gitea/workflows/dashboards.yml` | Grafana content changes or manual dispatch | Fast Grafana datasource/dashboard update through `ansible/dashboards.yml` |

-### Required Gitea Secrets
+Deploy and destroy share `concurrency.group: prod-cluster` so they do not run at the same time.

-Set these in your Gitea repository settings (**Settings** → **Secrets** → **Actions**):
+Deploy sequence on push to `main`:
+
+1. Terraform fmt/init/validate/plan/apply.
+2. Cleanup/retry around known transient Proxmox clone and disk-update failures.
+3. Generate Ansible inventory from Terraform outputs.
+4. Prepare critical image archives with `skopeo` on the runner.
+5. Run `ansible/site.yml` to bootstrap nodes, K3s, kube-vip, prerequisite secrets, and kubeconfig.
+6. Apply Flux CRDs/controllers and the `clusters/prod/flux-system` graph.
+7. Gate cert-manager, External Secrets, Tailscale, NFS, Rancher, and observability.
+8. Run post-deploy health checks and Tailscale service smoke checks.
+
+Required Gitea secrets:

 | Secret | Description |
 |--------|-------------|
-| `PROXMOX_ENDPOINT` | Proxmox API endpoint (for example `https://100.105.0.115:8006/`) |
+| `PROXMOX_ENDPOINT` | Proxmox API endpoint, for example `https://100.105.0.115:8006/` |
 | `PROXMOX_API_TOKEN_ID` | Proxmox API token ID |
 | `PROXMOX_API_TOKEN_SECRET` | Proxmox API token secret |
-| `S3_ACCESS_KEY` | Backblaze B2 keyID |
-| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
-| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) |
-| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state`) |
+| `S3_ACCESS_KEY` | S3/Backblaze access key for Terraform state |
+| `S3_SECRET_KEY` | S3/Backblaze secret key for Terraform state |
+| `S3_ENDPOINT` | S3 endpoint, for example `https://s3.eu-central-003.backblazeb2.com` |
+| `S3_BUCKET` | Terraform state bucket, for example `k8s-terraform-state` |
 | `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
-| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net`) |
-| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
-| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
-| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
-| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
+| `TAILSCALE_TAILNET` | Tailnet domain, for example `silverside-gopher.ts.net` |
+| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for the Kubernetes operator |
+| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for the Kubernetes operator |
+| `TAILSCALE_API_KEY` | Optional API key used to delete stale offline reserved devices before service proxies exist |
+| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for runtime cluster secrets |
+| `GRAFANA_ADMIN_PASSWORD` | Optional Grafana admin password |
 | `SSH_PUBLIC_KEY` | SSH public key content |
 | `SSH_PRIVATE_KEY` | SSH private key content |

-## GitOps (Flux)
+## GitOps Graph

-This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
+Flux entrypoint:

-### Stable private-only baseline
+```text
+clusters/prod/flux-system/
+├── gotk-components.yaml
+├── gitrepository-platform.yaml
+├── kustomization-infrastructure.yaml
+└── kustomization-apps.yaml  # suspend: true
+```

-The current default target is the HA private baseline:
+Active infrastructure addons from `infrastructure/addons/kustomization.yaml`:

- `3` control plane nodes
- `5` worker nodes
- private Proxmox network only
- Tailscale for operator and service access
- Flux-managed platform addons with `apps` suspended by default
+- `addon-nfs-storage`
+- `addon-external-secrets`
+- `addon-cert-manager`
+- `addon-tailscale-operator`
+- `addon-tailscale-proxyclass`
+- `traefik` HelmRelease manifests applied directly by the top-level infrastructure Kustomization
+- `addon-observability`
+- `addon-observability-content`
+- `addon-rancher`
+- `addon-rancher-config`

-Detailed phase gates and success criteria live in `STABLE_BASELINE.md`.
+Chart/source strategy:

-This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
+- Vendored charts are intentional: `cert-manager`, `traefik`, `kube-prometheus-stack`, `tailscale-operator`, and `rancher` live under `infrastructure/charts/`.
+- External Secrets, Loki, and Promtail use Flux `OCIRepository` sources.
+- NFS storage is raw Kubernetes manifests, not a Helm chart.
+- Rancher backup/restore is not part of the current live graph.

-### Runtime secrets
+Doppler bootstrap details:

-Runtime cluster secrets are moving to Doppler + External Secrets Operator.
+- `ansible/roles/doppler-bootstrap` creates the `external-secrets` namespace and the Doppler token secret only.
+- The deploy workflow creates `ClusterSecretStore/doppler-hetznerterra` after ESO CRDs and webhook endpoints exist.
+- The checked-in `infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml` is not included by the addon kustomization.

- Doppler project: `hetznerterra`
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
- First synced secrets:
-  - `GRAFANA_ADMIN_PASSWORD`
+## Access URLs

-Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
+| Service | URL |
+|---------|-----|
+| Rancher | `https://rancher.silverside-gopher.ts.net/` |
+| Grafana | `http://grafana.silverside-gopher.ts.net/` |
+| Prometheus | `http://prometheus.silverside-gopher.ts.net:9090/` |

-### Repository layout
-
- `clusters/prod/`: cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/`: `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/`: infrastructure addon reconciliation graph
- `infrastructure/addons/*`: per-addon manifests for Flux-managed cluster addons
- `apps/`: application workload layer (currently scaffolded)
-
-### Reconciliation graph
-
- `infrastructure` (top-level)
-  - `addon-nfs-storage`
-  - `addon-tailscale-operator`
-  - `addon-observability`
-  - `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
-
-### Bootstrap notes
-
-1. Install Flux controllers in `flux-system`.
-2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
-3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
-4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
-
-### Current addon status
-
- Core infrastructure addons are Flux-managed from `infrastructure/addons/`.
- Active Flux addons for the current baseline: `addon-nfs-storage`, `addon-cert-manager`, `addon-external-secrets`, `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-observability`, `addon-observability-content`, `addon-rancher`, `addon-rancher-config`, `addon-rancher-backup`, `addon-rancher-backup-config`.
- `apps` remains suspended until workload rollout is explicitly enabled.
- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
-
-### Rancher access
-
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/`.
- Rancher and the Kubernetes API stay private; kube-vip provides the API VIP on the LAN.
- Rancher stores state in embedded etcd; no external database is used.
-
-### Stable baseline acceptance
-
-A rebuild is considered successful only when all of the following pass without manual intervention:
-
- Terraform create succeeds for the default `3` control planes and `5` workers.
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready`.
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
- Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
-
-## Observability Stack
-
-Flux deploys a lightweight observability stack in the `observability` namespace:
-
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
-
-Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/`.
-
-Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
-
-### Access Grafana and Prometheus
-
-Preferred private access:
-
- Grafana: `http://grafana.silverside-gopher.ts.net/`
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
-
-Fallback (port-forward from a tailnet-connected machine):
-
-Run from a tailnet-connected machine:
+Fallback port-forward from a tailnet-connected machine:

 ```bash
 export KUBECONFIG=$(pwd)/outputs/kubeconfig
-
 kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
 kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
 ```

-Then open:
+Grafana user is `admin`; password comes from the `GRAFANA_ADMIN_PASSWORD` Doppler secret or the workflow-provided fallback.

- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
+## Operations

-Grafana user: `admin`
-Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
+Scale workers by updating `terraform.tfvars` counts, IP lists, and VMID lists together. If node names or VMIDs change, also update the hard-coded retry cleanup target map in `.gitea/workflows/deploy.yml`.

-### Verify Tailscale exposure
+Upgrade K3s by changing the role defaults in `ansible/roles/k3s-server/defaults/main.yml` and `ansible/roles/k3s-agent/defaults/main.yml`. Check Rancher chart compatibility before moving to a Kubernetes minor outside `<1.35.0-0`.
+
+Destroy through the Gitea `Destroy` workflow with `confirm: destroy`, or locally with:

 ```bash
-export KUBECONFIG=$(pwd)/outputs/kubeconfig
+terraform -chdir=terraform destroy -var-file=../terraform.tfvars
+```

+## Troubleshooting
+
+Check K3s from cp1:
+
+```bash
+ssh ubuntu@10.27.27.30 'sudo k3s kubectl get nodes -o wide'
+ssh ubuntu@10.27.27.30 'sudo journalctl -u k3s -n 120 --no-pager'
+```
+
+Check Flux and Rancher:
+
+```bash
+kubectl -n flux-system get gitrepositories,kustomizations,helmreleases,ocirepositories
+kubectl -n flux-system describe helmrelease rancher
+kubectl -n cattle-system get pods,deploy -o wide
+```
+
+Check Tailscale services:
+
+```bash
 kubectl -n tailscale-system get pods
 kubectl -n cattle-system get svc rancher-tailscale
 kubectl -n observability get svc grafana-tailscale prometheus-tailscale
@@ -294,121 +271,14 @@ kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyRea
 kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
 ```

-If `TailscaleProxyReady=False`, check:
-
-```bash
-kubectl -n tailscale-system logs deployment/operator --tail=100
-```
-
-Common cause: OAuth client missing tag/scopes permissions.
-
-### Fast dashboard iteration workflow
-
-Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
-It avoids full cluster provisioning and only applies Grafana content resources:
-
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
-
-## File Structure
-
-```
-.
-├── terraform/
-│   ├── main.tf
-│   ├── variables.tf
-│   ├── servers.tf
-│   ├── outputs.tf
-│   └── backend.tf
-├── ansible/
-│   ├── inventory.tmpl
-│   ├── generate_inventory.py
-│   ├── site.yml
-│   ├── roles/
-│   │   ├── common/
-│   │   ├── k3s-server/
-│   │   ├── k3s-agent/
-│   │   ├── addon-secrets-bootstrap/
-│   │   ├── observability-content/
-│   │   └── observability/
-│   └── ansible.cfg
-├── .gitea/
-│   └── workflows/
-│       ├── terraform.yml
-│       ├── ansible.yml
-│       └── dashboards.yml
-├── outputs/
-├── terraform.tfvars.example
-└── README.md
-```
-
-## Firewall Rules
-
-This repo no longer manages cloud firewalls. Access control is expected to be handled on your LAN infrastructure and through Tailscale.
-
-Important cluster-local ports still in use:
-
-| Port | Source | Purpose |
-|------|--------|---------|
-| 22 | Admin hosts / CI | SSH |
-| 6443 | 10.27.27.0/24 + VIP | Kubernetes API |
-| 9345 | 10.27.27.0/24 | k3s Supervisor |
-| 2379 | 10.27.27.0/24 | etcd Client |
-| 2380 | 10.27.27.0/24 | etcd Peer |
-| 8472/udp | 10.27.27.0/24 | Flannel VXLAN |
-| 10250 | 10.27.27.0/24 | Kubelet |
-
-## Operations
-
-### Scale Workers
-
-Edit `terraform.tfvars`:
-
-```hcl
-worker_count = 5
-```
-
-Then:
-
-```bash
-terraform apply
-ansible-playbook site.yml
-```
-
-### Upgrade k3s
-
-```bash
-ansible-playbook site.yml -t upgrade
-```
-
-### Destroy Cluster
-
-```bash
-terraform destroy
-```
-
-## Troubleshooting
-
-### Check k3s Logs
-
-```bash
-ssh ubuntu@<control-plane-ip> sudo journalctl -u k3s -f
-```
-
-### Reset k3s
-
-```bash
-ansible-playbook site.yml -t reset
-```
+If local `kubectl` falls back to `localhost:8080`, refresh `outputs/kubeconfig` with `scripts/refresh-kubeconfig.sh 10.27.27.30`.

 ## Security Notes

- Control plane has HA (3 nodes, can survive 1 failure)
- Kubernetes API HA is provided by kube-vip on `10.27.27.40`
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
+- Never commit `terraform.tfvars`, kubeconfigs, private keys, `outputs/`, or real secret values.
+- Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
+- Runtime cluster secrets are sourced from Doppler through External Secrets.
+- This repo does not manage Proxmox/LAN firewalls or public ingress.

 ## License