micqdf 31e95eb227
Deploy Cluster / Terraform (push) Successful in 28s
Deploy Cluster / Ansible (push) Failing after 16m39s
fix: pre-pull Flux controllers before bootstrap rollout
2026-04-23 20:36:57 +00:00

Proxmox Kubernetes Cluster

Production-ready private Kubernetes cluster on Proxmox using Terraform, Ansible, and Flux.

Architecture

Component Details
Control Plane 3x Proxmox VMs (2 vCPU / 4 GiB / 32 GiB)
Workers 5x Proxmox VMs (4 vCPU / 8 GiB / 64 GiB)
K8s k3s (latest, HA)
Addons NFS provisioner + Prometheus + Grafana + Loki + Rancher
Access SSH/API and private services restricted to Tailnet
Bootstrap Terraform + Ansible + Flux

Prerequisites

1. Proxmox API Token

Create an API token for the Proxmox VE user used by Terraform. The repo expects the bpg/proxmox provider with:

  • endpoint: https://100.105.0.115:8006/
  • node: flex
  • clone source: template 9000 (ubuntu-2404-k8s-template)
  • auth: API token

2. Backblaze B2 Bucket (for Terraform State)

  1. Go to Backblaze B2
  2. Click Create a Bucket
  3. Set bucket name: k8s-terraform-state (must be globally unique)
  4. Choose Private access
  5. Click Create Bucket
  6. Create application key:
    • Go to App KeysAdd a New Application Key
    • Name: terraform-state
    • Allow access to: k8s-terraform-state bucket only
    • Type: Read and Write
    • Copy keyID (access key) and applicationKey (secret key)
  7. Note your bucket's S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)

3. SSH Key Pair

ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra

4. Local Tools

Setup

1. Clone Repository

git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra

2. Configure Variables

cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars:

 proxmox_endpoint         = "https://100.105.0.115:8006/"
 proxmox_api_token_id     = "terraform-prov@pve!k8s-cluster"
 proxmox_api_token_secret = "your-proxmox-token-secret"

 ssh_public_key  = "~/.ssh/infra.pub"
 ssh_private_key = "~/.ssh/infra"

s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket     = "k8s-terraform-state"

tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet  = "yourtailnet.ts.net"

 kube_api_vip = "10.27.27.40"

3. Initialize Terraform

cd terraform

# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint                    = "https://s3.eu-central-003.backblazeb2.com"
bucket                      = "k8s-terraform-state"
access_key                  = "your-backblaze-key-id"
secret_key                  = "your-backblaze-application-key"
skip_requesting_account_id  = true
EOF

terraform init -backend-config=backend.hcl

4. Plan and Apply

terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars

5. Generate Ansible Inventory

cd ../ansible
python3 generate_inventory.py

6. Bootstrap Cluster

ansible-playbook site.yml

7. Get Kubeconfig

export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes

Use scripts/refresh-kubeconfig.sh <cp1-public-ip> to refresh kubeconfig against the primary control-plane public IP after rebuilds.

Gitea CI/CD

This repository includes Gitea workflows for:

  • deploy: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
  • destroy: Cluster teardown with backup-aware cleanup
  • dashboards: Fast workflow that updates Grafana datasources/dashboards only

Required Gitea Secrets

Set these in your Gitea repository settings (SettingsSecretsActions):

Secret Description
PROXMOX_ENDPOINT Proxmox API endpoint (for example https://100.105.0.115:8006/)
PROXMOX_API_TOKEN_ID Proxmox API token ID
PROXMOX_API_TOKEN_SECRET Proxmox API token secret
S3_ACCESS_KEY Backblaze B2 keyID
S3_SECRET_KEY Backblaze B2 applicationKey
S3_ENDPOINT Backblaze S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)
S3_BUCKET S3 bucket name (e.g., k8s-terraform-state)
TAILSCALE_AUTH_KEY Tailscale auth key for node bootstrap
TAILSCALE_TAILNET Tailnet domain (e.g., yourtailnet.ts.net)
TAILSCALE_OAUTH_CLIENT_ID Tailscale OAuth client ID for Kubernetes Operator
TAILSCALE_OAUTH_CLIENT_SECRET Tailscale OAuth client secret for Kubernetes Operator
DOPPLER_HETZNERTERRA_SERVICE_TOKEN Doppler service token for hetznerterra runtime secrets
GRAFANA_ADMIN_PASSWORD Optional admin password for Grafana (auto-generated if unset)
SSH_PUBLIC_KEY SSH public key content
SSH_PRIVATE_KEY SSH private key content

GitOps (Flux)

This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.

Stable private-only baseline

The current default target is the HA private baseline:

  • 3 control plane nodes
  • 5 worker nodes
  • private Proxmox network only
  • Tailscale for operator and service access
  • Flux-managed platform addons with apps suspended by default

Detailed phase gates and success criteria live in STABLE_BASELINE.md.

This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.

Runtime secrets

Runtime cluster secrets are moving to Doppler + External Secrets Operator.

  • Doppler project: hetznerterra
  • Initial auth: service token via DOPPLER_HETZNERTERRA_SERVICE_TOKEN
  • First synced secrets:
    • GRAFANA_ADMIN_PASSWORD

Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.

Repository layout

  • clusters/prod/: cluster entrypoint and Flux reconciliation objects
  • clusters/prod/flux-system/: GitRepository source and top-level Kustomization graph
  • infrastructure/: infrastructure addon reconciliation graph
  • infrastructure/addons/*: per-addon manifests for Flux-managed cluster addons
  • apps/: application workload layer (currently scaffolded)

Reconciliation graph

  • infrastructure (top-level)
    • addon-nfs-storage
    • addon-tailscale-operator
    • addon-observability
    • addon-observability-content depends on addon-observability
  • apps depends on infrastructure

Bootstrap notes

  1. Install Flux controllers in flux-system.
  2. Create the Flux deploy key/secret named flux-system in flux-system namespace.
  3. Apply clusters/prod/flux-system/ once to establish source + reconciliation graph.
  4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.

Current addon status

  • Core infrastructure addons are Flux-managed from infrastructure/addons/.
  • Active Flux addons for the current baseline: addon-nfs-storage, addon-cert-manager, addon-external-secrets, addon-tailscale-operator, addon-tailscale-proxyclass, addon-observability, addon-observability-content, addon-rancher, addon-rancher-config, addon-rancher-backup, addon-rancher-backup-config.
  • apps remains suspended until workload rollout is explicitly enabled.
  • Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
  • Weave GitOps / Flux UI is no longer deployed; use Rancher or the flux CLI for Flux operations.

Rancher access

  • Rancher is private-only and exposed through Tailscale at https://rancher.silverside-gopher.ts.net/.
  • Rancher and the Kubernetes API stay private; kube-vip provides the API VIP on the LAN.
  • Rancher stores state in embedded etcd; no external database is used.

Stable baseline acceptance

A rebuild is considered successful only when all of the following pass without manual intervention:

  • Terraform create succeeds for the default 3 control planes and 5 workers.
  • Ansible bootstrap succeeds end-to-end.
  • All nodes become Ready.
  • Flux core reconciliation is healthy.
  • External Secrets Operator is ready.
  • Tailscale operator is ready.
  • Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
  • Terraform destroy succeeds cleanly or succeeds after workflow retries.

Observability Stack

Flux deploys a lightweight observability stack in the observability namespace:

  • kube-prometheus-stack (Prometheus + Grafana)
  • loki
  • promtail

Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/.

Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.

Access Grafana and Prometheus

Preferred private access:

  • Grafana: http://grafana.silverside-gopher.ts.net/
  • Prometheus: http://prometheus.silverside-gopher.ts.net:9090/

Fallback (port-forward from a tailnet-connected machine):

Run from a tailnet-connected machine:

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090

Then open:

Grafana user: admin Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)

Verify Tailscale exposure

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n tailscale-system get pods
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady

If TailscaleProxyReady=False, check:

kubectl -n tailscale-system logs deployment/operator --tail=100

Common cause: OAuth client missing tag/scopes permissions.

Fast dashboard iteration workflow

Use the Deploy Grafana Content workflow when changing dashboard/data source templates. It avoids full cluster provisioning and only applies Grafana content resources:

  • ansible/roles/observability-content/templates/grafana-datasources.yaml.j2
  • ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2
  • ansible/dashboards.yml

File Structure

.
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── servers.tf
│   ├── outputs.tf
│   └── backend.tf
├── ansible/
│   ├── inventory.tmpl
│   ├── generate_inventory.py
│   ├── site.yml
│   ├── roles/
│   │   ├── common/
│   │   ├── k3s-server/
│   │   ├── k3s-agent/
│   │   ├── addon-secrets-bootstrap/
│   │   ├── observability-content/
│   │   └── observability/
│   └── ansible.cfg
├── .gitea/
│   └── workflows/
│       ├── terraform.yml
│       ├── ansible.yml
│       └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md

Firewall Rules

This repo no longer manages cloud firewalls. Access control is expected to be handled on your LAN infrastructure and through Tailscale.

Important cluster-local ports still in use:

Port Source Purpose
22 Admin hosts / CI SSH
6443 10.27.27.0/24 + VIP Kubernetes API
9345 10.27.27.0/24 k3s Supervisor
2379 10.27.27.0/24 etcd Client
2380 10.27.27.0/24 etcd Peer
8472/udp 10.27.27.0/24 Flannel VXLAN
10250 10.27.27.0/24 Kubelet

Operations

Scale Workers

Edit terraform.tfvars:

worker_count = 5

Then:

terraform apply
ansible-playbook site.yml

Upgrade k3s

ansible-playbook site.yml -t upgrade

Destroy Cluster

terraform destroy

Troubleshooting

Check k3s Logs

ssh ubuntu@<control-plane-ip> sudo journalctl -u k3s -f

Reset k3s

ansible-playbook site.yml -t reset

Security Notes

  • Control plane has HA (3 nodes, can survive 1 failure)
  • Kubernetes API HA is provided by kube-vip on 10.27.27.40
  • Rotate API tokens regularly
  • Use network policies in Kubernetes
  • Enable audit logging for production

License

MIT

S
Description
No description provided
Readme 3.3 MiB
Languages
Go Template 81.5%
HCL 5.5%
Python 4.7%
Shell 4.3%
Jinja 4%