Accept Proxmox API endpoints with or without /api2/json in CI and local tfvars, and avoid running the dashboards workflow just because its own workflow file changed during platform migrations.
Proxmox Kubernetes Cluster
Production-ready private Kubernetes cluster on Proxmox using Terraform, Ansible, and Flux.
Architecture
| Component | Details |
|---|---|
| Control Plane | 3x Proxmox VMs (2 vCPU / 4 GiB / 32 GiB) |
| Workers | 5x Proxmox VMs (4 vCPU / 8 GiB / 64 GiB) |
| K8s | k3s (latest, HA) |
| Addons | NFS provisioner + Prometheus + Grafana + Loki + Rancher |
| Access | SSH/API and private services restricted to Tailnet |
| Bootstrap | Terraform + Ansible + Flux |
Prerequisites
1. Proxmox API Token
Create an API token for the Proxmox VE user used by Terraform. The repo expects the bpg/proxmox provider with:
- endpoint:
https://100.105.0.115:8006/ - node:
flex - clone source: template
9000(ubuntu-2404-k8s-template) - auth: API token
2. Backblaze B2 Bucket (for Terraform State)
- Go to Backblaze B2
- Click Create a Bucket
- Set bucket name:
k8s-terraform-state(must be globally unique) - Choose Private access
- Click Create Bucket
- Create application key:
- Go to App Keys → Add a New Application Key
- Name:
terraform-state - Allow access to:
k8s-terraform-statebucket only - Type: Read and Write
- Copy keyID (access key) and applicationKey (secret key)
- Note your bucket's S3 endpoint (e.g.,
https://s3.eu-central-003.backblazeb2.com)
3. SSH Key Pair
ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra
4. Local Tools
Setup
1. Clone Repository
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
2. Configure Variables
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars:
proxmox_endpoint = "https://100.105.0.115:8006/"
proxmox_api_token_id = "terraform-prov@pve!k8s-cluster"
proxmox_api_token_secret = "your-proxmox-token-secret"
ssh_public_key = "~/.ssh/infra.pub"
ssh_private_key = "~/.ssh/infra"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
kube_api_vip = "10.27.27.40"
3. Initialize Terraform
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint = "https://s3.eu-central-003.backblazeb2.com"
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
EOF
terraform init -backend-config=backend.hcl
4. Plan and Apply
terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars
5. Generate Ansible Inventory
cd ../ansible
python3 generate_inventory.py
6. Bootstrap Cluster
ansible-playbook site.yml
7. Get Kubeconfig
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes
Use scripts/refresh-kubeconfig.sh <cp1-public-ip> to refresh kubeconfig against the primary control-plane public IP after rebuilds.
Gitea CI/CD
This repository includes Gitea workflows for:
- deploy: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
- destroy: Cluster teardown with backup-aware cleanup
- dashboards: Fast workflow that updates Grafana datasources/dashboards only
Required Gitea Secrets
Set these in your Gitea repository settings (Settings → Secrets → Actions):
| Secret | Description |
|---|---|
PROXMOX_ENDPOINT |
Proxmox API endpoint (for example https://100.105.0.115:8006/) |
PROXMOX_API_TOKEN_ID |
Proxmox API token ID |
PROXMOX_API_TOKEN_SECRET |
Proxmox API token secret |
S3_ACCESS_KEY |
Backblaze B2 keyID |
S3_SECRET_KEY |
Backblaze B2 applicationKey |
S3_ENDPOINT |
Backblaze S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com) |
S3_BUCKET |
S3 bucket name (e.g., k8s-terraform-state) |
TAILSCALE_AUTH_KEY |
Tailscale auth key for node bootstrap |
TAILSCALE_TAILNET |
Tailnet domain (e.g., yourtailnet.ts.net) |
TAILSCALE_OAUTH_CLIENT_ID |
Tailscale OAuth client ID for Kubernetes Operator |
TAILSCALE_OAUTH_CLIENT_SECRET |
Tailscale OAuth client secret for Kubernetes Operator |
DOPPLER_HETZNERTERRA_SERVICE_TOKEN |
Doppler service token for hetznerterra runtime secrets |
GRAFANA_ADMIN_PASSWORD |
Optional admin password for Grafana (auto-generated if unset) |
SSH_PUBLIC_KEY |
SSH public key content |
SSH_PRIVATE_KEY |
SSH private key content |
GitOps (Flux)
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
Stable private-only baseline
The current default target is the HA private baseline:
3control plane nodes5worker nodes- private Proxmox network only
- Tailscale for operator and service access
- Flux-managed platform addons with
appssuspended by default
Detailed phase gates and success criteria live in STABLE_BASELINE.md.
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
Runtime secrets
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
- Doppler project:
hetznerterra - Initial auth: service token via
DOPPLER_HETZNERTERRA_SERVICE_TOKEN - First synced secrets:
GRAFANA_ADMIN_PASSWORD
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
Repository layout
clusters/prod/: cluster entrypoint and Flux reconciliation objectsclusters/prod/flux-system/:GitRepositorysource and top-levelKustomizationgraphinfrastructure/: infrastructure addon reconciliation graphinfrastructure/addons/*: per-addon manifests for Flux-managed cluster addonsapps/: application workload layer (currently scaffolded)
Reconciliation graph
infrastructure(top-level)addon-nfs-storageaddon-tailscale-operatoraddon-observabilityaddon-observability-contentdepends onaddon-observability
appsdepends oninfrastructure
Bootstrap notes
- Install Flux controllers in
flux-system. - Create the Flux deploy key/secret named
flux-systeminflux-systemnamespace. - Apply
clusters/prod/flux-system/once to establish source + reconciliation graph. - Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
Current addon status
- Core infrastructure addons are Flux-managed from
infrastructure/addons/. - Active Flux addons for the current baseline:
addon-nfs-storage,addon-cert-manager,addon-external-secrets,addon-tailscale-operator,addon-tailscale-proxyclass,addon-observability,addon-observability-content,addon-rancher,addon-rancher-config,addon-rancher-backup,addon-rancher-backup-config. appsremains suspended until workload rollout is explicitly enabled.- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the
fluxCLI for Flux operations.
Rancher access
- Rancher is private-only and exposed through Tailscale at
https://rancher.silverside-gopher.ts.net/. - Rancher and the Kubernetes API stay private; kube-vip provides the API VIP on the LAN.
- Rancher stores state in embedded etcd; no external database is used.
Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
- Terraform create succeeds for the default
3control planes and5workers. - Ansible bootstrap succeeds end-to-end.
- All nodes become
Ready. - Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
- Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
Observability Stack
Flux deploys a lightweight observability stack in the observability namespace:
kube-prometheus-stack(Prometheus + Grafana)lokipromtail
Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/.
Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
Access Grafana and Prometheus
Preferred private access:
- Grafana:
http://grafana.silverside-gopher.ts.net/ - Prometheus:
http://prometheus.silverside-gopher.ts.net:9090/
Fallback (port-forward from a tailnet-connected machine):
Run from a tailnet-connected machine:
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: admin
Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)
Verify Tailscale exposure
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n tailscale-system get pods
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
If TailscaleProxyReady=False, check:
kubectl -n tailscale-system logs deployment/operator --tail=100
Common cause: OAuth client missing tag/scopes permissions.
Fast dashboard iteration workflow
Use the Deploy Grafana Content workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
ansible/roles/observability-content/templates/grafana-datasources.yaml.j2ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2ansible/dashboards.yml
File Structure
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
│ │ ├── addon-secrets-bootstrap/
│ │ ├── observability-content/
│ │ └── observability/
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
│ ├── ansible.yml
│ └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md
Firewall Rules
This repo no longer manages cloud firewalls. Access control is expected to be handled on your LAN infrastructure and through Tailscale.
Important cluster-local ports still in use:
| Port | Source | Purpose |
|---|---|---|
| 22 | Admin hosts / CI | SSH |
| 6443 | 10.27.27.0/24 + VIP | Kubernetes API |
| 9345 | 10.27.27.0/24 | k3s Supervisor |
| 2379 | 10.27.27.0/24 | etcd Client |
| 2380 | 10.27.27.0/24 | etcd Peer |
| 8472/udp | 10.27.27.0/24 | Flannel VXLAN |
| 10250 | 10.27.27.0/24 | Kubelet |
Operations
Scale Workers
Edit terraform.tfvars:
worker_count = 5
Then:
terraform apply
ansible-playbook site.yml
Upgrade k3s
ansible-playbook site.yml -t upgrade
Destroy Cluster
terraform destroy
Troubleshooting
Check k3s Logs
ssh ubuntu@<control-plane-ip> sudo journalctl -u k3s -f
Reset k3s
ansible-playbook site.yml -t reset
Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Kubernetes API HA is provided by kube-vip on
10.27.27.40 - Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
License
MIT