2026-04-22 03:02:13 +00:00
# Proxmox Kubernetes Cluster
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Private HA K3s cluster on Proxmox, provisioned by Terraform, bootstrapped by Ansible, and reconciled by Flux.
2026-02-28 20:24:55 +00:00
## Architecture
2026-04-30 07:03:21 +00:00
| Component | Current Baseline |
|-----------|------------------|
| **Control plane ** | 3 Proxmox VMs, VMIDs `200-202` , IPs `10.27.27.30-32` , 2 vCPU / 4 GiB / 32 GiB |
| **Workers ** | 5 Proxmox VMs, VMIDs `210-214` , IPs `10.27.27.41-45` , 4 vCPU / 8 GiB / 64 GiB |
| **Kubernetes ** | K3s `v1.34.6+k3s1` , HA embedded etcd, kube-vip API VIP `10.27.27.40` |
| **Proxmox ** | Node `flex` , template VMID `9000` , datastore `Flash` , bridge `vmbr0` |
| **Storage ** | Raw-manifest `nfs-subdir-external-provisioner` , `10.27.27.239:/TheFlash/k8s-nfs` , default StorageClass `flash-nfs` |
| **GitOps ** | Flux source `platform` on branch `main` ; `apps` Kustomization is intentionally suspended |
| **Private access ** | Tailscale operator exposes Rancher, Grafana, and Prometheus; no public ingress baseline |
| **Runtime secrets ** | Doppler service token bootstraps External Secrets Operator |
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
K3s is pinned because Rancher chart `2.13.3` requires Kubernetes `<1.35.0-0` .
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
## Prerequisites
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
- Terraform `>= 1.0` .
- Ansible with Python `jinja2` and `pyyaml` .
- `kubectl` for local verification.
- Proxmox API token for the `bpg/proxmox` provider.
- S3-compatible bucket for Terraform state, currently Backblaze B2.
- SSH key pair available to Terraform and Ansible, defaulting to `~/.ssh/infra` and `~/.ssh/infra.pub` .
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Expected Proxmox inputs:
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
| Setting | Value |
|---------|-------|
| Endpoint | `https://100.105.0.115:8006/` |
| Node | `flex` |
| Clone source | Template VMID `9000` (`ubuntu-2404-k8s-template` ) |
| Storage | `Flash` |
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
## Local Setup
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Create local variables from the example:
2026-02-28 20:24:55 +00:00
``` bash
cp terraform.tfvars.example terraform.tfvars
```
2026-04-30 07:03:21 +00:00
Important defaults in `terraform.tfvars.example` :
2026-02-28 20:24:55 +00:00
``` hcl
2026-04-30 07:03:21 +00:00
proxmox_endpoint = "https://100.105.0.115:8006/"
proxmox_api_token_id = "terraform-prov@pve!k8s-cluster"
proxmox_api_token_secret = "your-proxmox-api-token-secret"
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
ssh_public_key = "~/.ssh/infra.pub"
ssh_private_key = "~/.ssh/infra"
2026-02-28 20:24:55 +00:00
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
2026-04-30 07:03:21 +00:00
tailscale_tailnet = "yourtailnet.ts.net"
kube_api_vip = "10.27.27.40"
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
Initialize Terraform with backend credentials:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
terraform -chdir= terraform init \
-backend-config= "endpoint=<s3-endpoint>" \
-backend-config= "bucket=<s3-bucket>" \
-backend-config= "region=auto" \
-backend-config= "access_key=<s3-access-key>" \
-backend-config= "secret_key=<s3-secret-key>" \
-backend-config= "skip_requesting_account_id=true"
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
## Common Commands
Terraform:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
terraform -chdir= terraform fmt -recursive
terraform -chdir= terraform validate
terraform -chdir= terraform plan -var-file= ../terraform.tfvars
terraform -chdir= terraform apply -var-file= ../terraform.tfvars
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
Ansible setup:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
ansible-galaxy collection install -r ansible/requirements.yml
cd ansible
2026-02-28 20:24:55 +00:00
python3 generate_inventory.py
2026-04-30 07:03:21 +00:00
ansible-playbook site.yml --syntax-check
```
Manual Ansible bootstrap uses the same extra vars as the deploy workflow:
``` bash
cd ansible
ansible-playbook site.yml \
-e " tailscale_auth_key= $TAILSCALE_AUTH_KEY " \
-e " tailscale_tailnet= $TAILSCALE_TAILNET " \
-e " tailscale_oauth_client_id= $TAILSCALE_OAUTH_CLIENT_ID " \
-e " tailscale_oauth_client_secret= $TAILSCALE_OAUTH_CLIENT_SECRET " \
-e " doppler_hetznerterra_service_token= $DOPPLER_HETZNERTERRA_SERVICE_TOKEN " \
-e " tailscale_api_key= ${ TAILSCALE_API_KEY :- } " \
-e " grafana_admin_password= ${ GRAFANA_ADMIN_PASSWORD :- } " \
-e "cluster_name=k8s-cluster"
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
Flux/Kustomize verification:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
kubectl kustomize infrastructure/addons/<addon>
kubectl kustomize infrastructure/addons
kubectl kustomize clusters/prod/flux-system
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
Refresh kubeconfig after rebuilds:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
scripts/refresh-kubeconfig.sh 10.27.27.30
2026-02-28 20:24:55 +00:00
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl get nodes
```
2026-04-30 07:03:21 +00:00
Run the tailnet smoke check from cp1:
``` bash
ssh ubuntu@10.27.27.30 'bash -s' < scripts/smoke-check-tailnet-services.sh
```
2026-03-01 17:12:12 +00:00
2026-02-28 20:24:55 +00:00
## Gitea CI/CD
2026-04-30 07:03:21 +00:00
The supported full rebuild path is the Gitea deploy workflow.
| Workflow | Trigger | Purpose |
|----------|---------|---------|
| `.gitea/workflows/deploy.yml` | PR to `main` , push to `main` , manual dispatch | PRs run Terraform plan; pushes run Terraform apply, Ansible bootstrap, Flux bootstrap, addon gates, health checks, and tailnet smoke checks |
| `.gitea/workflows/destroy.yml` | Manual dispatch with `confirm: destroy` | Terraform destroy with retries; no Rancher backup gate |
| `.gitea/workflows/dashboards.yml` | Grafana content changes or manual dispatch | Fast Grafana datasource/dashboard update through `ansible/dashboards.yml` |
Deploy and destroy share `concurrency.group: prod-cluster` so they do not run at the same time.
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Deploy sequence on push to `main` :
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
1. Terraform fmt/init/validate/plan/apply.
2. Cleanup/retry around known transient Proxmox clone and disk-update failures.
3. Generate Ansible inventory from Terraform outputs.
4. Prepare critical image archives with `skopeo` on the runner.
5. Run `ansible/site.yml` to bootstrap nodes, K3s, kube-vip, prerequisite secrets, and kubeconfig.
6. Apply Flux CRDs/controllers and the `clusters/prod/flux-system` graph.
7. Gate cert-manager, External Secrets, Tailscale, NFS, Rancher, and observability.
8. Run post-deploy health checks and Tailscale service smoke checks.
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Required Gitea secrets:
2026-02-28 20:24:55 +00:00
| Secret | Description |
|--------|-------------|
2026-04-30 07:03:21 +00:00
| `PROXMOX_ENDPOINT` | Proxmox API endpoint, for example `https://100.105.0.115:8006/` |
2026-04-22 03:02:13 +00:00
| `PROXMOX_API_TOKEN_ID` | Proxmox API token ID |
| `PROXMOX_API_TOKEN_SECRET` | Proxmox API token secret |
2026-04-30 07:03:21 +00:00
| `S3_ACCESS_KEY` | S3/Backblaze access key for Terraform state |
| `S3_SECRET_KEY` | S3/Backblaze secret key for Terraform state |
| `S3_ENDPOINT` | S3 endpoint, for example `https://s3.eu-central-003.backblazeb2.com` |
| `S3_BUCKET` | Terraform state bucket, for example `k8s-terraform-state` |
2026-03-01 04:04:56 +00:00
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
2026-04-30 07:03:21 +00:00
| `TAILSCALE_TAILNET` | Tailnet domain, for example `silverside-gopher.ts.net` |
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for the Kubernetes operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for the Kubernetes operator |
| `TAILSCALE_API_KEY` | Optional API key used to delete stale offline reserved devices before service proxies exist |
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for runtime cluster secrets |
| `GRAFANA_ADMIN_PASSWORD` | Optional Grafana admin password |
2026-02-28 20:24:55 +00:00
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
2026-04-30 07:03:21 +00:00
## GitOps Graph
2026-03-04 23:38:40 +00:00
2026-04-30 07:03:21 +00:00
Flux entrypoint:
2026-03-04 23:38:40 +00:00
2026-04-30 07:03:21 +00:00
``` text
clusters/prod/flux-system/
├── gotk-components.yaml
├── gitrepository-platform.yaml
├── kustomization-infrastructure.yaml
└── kustomization-apps.yaml # suspend: true
```
2026-03-02 01:33:41 +00:00
2026-04-30 07:03:21 +00:00
Active infrastructure addons from `infrastructure/addons/kustomization.yaml` :
2026-03-02 01:33:41 +00:00
2026-04-30 07:03:21 +00:00
- `addon-nfs-storage`
- `addon-external-secrets`
- `addon-cert-manager`
- `addon-tailscale-operator`
- `addon-tailscale-proxyclass`
- `traefik` HelmRelease manifests applied directly by the top-level infrastructure Kustomization
- `addon-observability`
- `addon-observability-content`
- `addon-rancher`
- `addon-rancher-config`
2026-03-02 01:33:41 +00:00
2026-04-30 07:03:21 +00:00
Chart/source strategy:
2026-03-04 03:36:01 +00:00
2026-04-30 07:03:21 +00:00
- Vendored charts are intentional: `cert-manager` , `traefik` , `kube-prometheus-stack` , `tailscale-operator` , and `rancher` live under `infrastructure/charts/` .
- External Secrets, Loki, and Promtail use Flux `OCIRepository` sources.
- NFS storage is raw Kubernetes manifests, not a Helm chart.
- Rancher backup/restore is not part of the current live graph.
2026-03-02 01:33:41 +00:00
2026-04-30 07:03:21 +00:00
Doppler bootstrap details:
2026-03-02 01:33:41 +00:00
2026-04-30 07:03:21 +00:00
- `ansible/roles/doppler-bootstrap` creates the `external-secrets` namespace and the Doppler token secret only.
- The deploy workflow creates `ClusterSecretStore/doppler-hetznerterra` after ESO CRDs and webhook endpoints exist.
- The checked-in `infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml` is not included by the addon kustomization.
2026-03-04 01:37:00 +00:00
2026-04-30 07:03:21 +00:00
## Access URLs
2026-03-04 01:37:00 +00:00
2026-04-30 07:03:21 +00:00
| Service | URL |
|---------|-----|
| Rancher | `https://rancher.silverside-gopher.ts.net/` |
| Grafana | `http://grafana.silverside-gopher.ts.net/` |
| Prometheus | `http://prometheus.silverside-gopher.ts.net:9090/` |
2026-03-04 01:37:00 +00:00
2026-04-30 07:03:21 +00:00
Fallback port-forward from a tailnet-connected machine:
2026-03-02 01:33:41 +00:00
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
2026-04-30 07:03:21 +00:00
Grafana user is `admin` ; password comes from the `GRAFANA_ADMIN_PASSWORD` Doppler secret or the workflow-provided fallback.
2026-02-28 20:24:55 +00:00
## Operations
2026-04-30 07:03:21 +00:00
Scale workers by updating `terraform.tfvars` counts, IP lists, and VMID lists together. If node names or VMIDs change, also update the hard-coded retry cleanup target map in `.gitea/workflows/deploy.yml` .
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Upgrade K3s by changing the role defaults in `ansible/roles/k3s-server/defaults/main.yml` and `ansible/roles/k3s-agent/defaults/main.yml` . Check Rancher chart compatibility before moving to a Kubernetes minor outside `<1.35.0-0` .
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Destroy through the Gitea `Destroy` workflow with `confirm: destroy` , or locally with:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
terraform -chdir= terraform destroy -var-file= ../terraform.tfvars
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
## Troubleshooting
2026-02-28 20:24:55 +00:00
2026-04-30 07:03:21 +00:00
Check K3s from cp1:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
ssh ubuntu@10.27.27.30 'sudo k3s kubectl get nodes -o wide'
ssh ubuntu@10.27.27.30 'sudo journalctl -u k3s -n 120 --no-pager'
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
Check Flux and Rancher:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
kubectl -n flux-system get gitrepositories,kustomizations,helmreleases,ocirepositories
kubectl -n flux-system describe helmrelease rancher
kubectl -n cattle-system get pods,deploy -o wide
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
Check Tailscale services:
2026-02-28 20:24:55 +00:00
``` bash
2026-04-30 07:03:21 +00:00
kubectl -n tailscale-system get pods
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
2026-02-28 20:24:55 +00:00
```
2026-04-30 07:03:21 +00:00
If local `kubectl` falls back to `localhost:8080` , refresh `outputs/kubeconfig` with `scripts/refresh-kubeconfig.sh 10.27.27.30` .
2026-02-28 20:24:55 +00:00
## Security Notes
2026-04-30 07:03:21 +00:00
- Never commit `terraform.tfvars` , kubeconfigs, private keys, `outputs/` , or real secret values.
- Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
- Runtime cluster secrets are sourced from Doppler through External Secrets.
- This repo does not manage Proxmox/LAN firewalls or public ingress.
2026-02-28 20:24:55 +00:00
## License
MIT