11 KiB
Hetzner Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
Architecture
| Component | Details |
|---|---|
| Control Plane | 3x CX23 (HA) |
| Workers | 4x CX33 |
| Total Cost | €28.93/mo |
| K8s | k3s (latest, HA) |
| Addons | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
| Access | SSH/API restricted to Tailnet |
| Bootstrap | Terraform + Ansible |
Cluster Resources
- 22 vCPU total (6 CP + 16 workers)
- 44 GB RAM total (12 CP + 32 workers)
- 440 GB SSD storage
- 140 TB bandwidth allocation
Prerequisites
1. Hetzner Cloud API Token
- Go to Hetzner Cloud Console
- Select your project (or create a new one)
- Navigate to Security → API Tokens
- Click Generate API Token
- Set description:
k8s-cluster-terraform - Select permissions: Read & Write
- Click Generate API Token
- Copy the token immediately - it won't be shown again!
2. Backblaze B2 Bucket (for Terraform State)
- Go to Backblaze B2
- Click Create a Bucket
- Set bucket name:
k8s-terraform-state(must be globally unique) - Choose Private access
- Click Create Bucket
- Create application key:
- Go to App Keys → Add a New Application Key
- Name:
terraform-state - Allow access to:
k8s-terraform-statebucket only - Type: Read and Write
- Copy keyID (access key) and applicationKey (secret key)
- Note your bucket's S3 endpoint (e.g.,
https://s3.eu-central-003.backblazeb2.com)
3. SSH Key Pair
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
4. Local Tools
Setup
1. Clone Repository
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
2. Configure Variables
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars:
hcloud_token = "your-hetzner-api-token"
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = []
allowed_api_ips = []
3. Initialize Terraform
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint = "https://s3.eu-central-003.backblazeb2.com"
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
EOF
terraform init -backend-config=backend.hcl
4. Plan and Apply
terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars
5. Generate Ansible Inventory
cd ../ansible
python3 generate_inventory.py
6. Bootstrap Cluster
ansible-playbook site.yml
7. Get Kubeconfig
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (k8s-cluster-cp-1.<your-tailnet>).
Gitea CI/CD
This repository includes Gitea workflows for:
- terraform-plan: Runs on PRs, shows planned changes
- terraform-apply: Runs on main branch after merge
- ansible-deploy: Runs after terraform apply
- dashboards: Fast workflow that updates Grafana datasources/dashboards only
Required Gitea Secrets
Set these in your Gitea repository settings (Settings → Secrets → Actions):
| Secret | Description |
|---|---|
HCLOUD_TOKEN |
Hetzner Cloud API token |
S3_ACCESS_KEY |
Backblaze B2 keyID |
S3_SECRET_KEY |
Backblaze B2 applicationKey |
S3_ENDPOINT |
Backblaze S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com) |
S3_BUCKET |
S3 bucket name (e.g., k8s-terraform-state) |
TAILSCALE_AUTH_KEY |
Tailscale auth key for node bootstrap |
TAILSCALE_TAILNET |
Tailnet domain (e.g., yourtailnet.ts.net) |
TAILSCALE_OAUTH_CLIENT_ID |
Tailscale OAuth client ID for Kubernetes Operator |
TAILSCALE_OAUTH_CLIENT_SECRET |
Tailscale OAuth client secret for Kubernetes Operator |
GRAFANA_ADMIN_PASSWORD |
Optional admin password for Grafana (auto-generated if unset) |
RUNNER_ALLOWED_CIDRS |
Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
SSH_PUBLIC_KEY |
SSH public key content |
SSH_PRIVATE_KEY |
SSH private key content |
GitOps (Flux)
This repo now includes a Flux GitOps layout for phased migration from imperative Ansible applies to continuous reconciliation.
Repository layout
clusters/prod/: cluster entrypoint and Flux reconciliation objectsclusters/prod/flux-system/:GitRepositorysource and top-levelKustomizationgraphinfrastructure/: infrastructure addon reconciliation graphinfrastructure/addons/*: per-addon manifests (observability + observability-content migrated)apps/: application workload layer (currently scaffolded)
Reconciliation graph
infrastructure(top-level)addon-ccmaddon-csidepends onaddon-ccmaddon-tailscale-operatoraddon-observabilityaddon-observability-contentdepends onaddon-observability
appsdepends oninfrastructure
Bootstrap notes
- Install Flux controllers in
flux-system. - Create the Flux deploy key/secret named
flux-systeminflux-systemnamespace. - Apply
clusters/prod/flux-system/once to establish source + reconciliation graph. - Unsuspend addon
Kustomizationobjects one-by-one as each addon is migrated from Ansible.
Current migration status
addon-observability-contentis now GitOps-managed frominfrastructure/addons/observability-content/.addon-observabilityis now GitOps-managed frominfrastructure/addons/observability/using FluxHelmReleaseresources for:kube-prometheus-stacklokipromtail
- Remaining addons stay suspended until migrated.
- During transition, avoid applying Grafana content from both Flux and Ansible at the same time.
Ansible site.yml now skips observability and observability-content roles by default when observability_gitops_enabled=true (default).
Observability Stack
Flux deploys a lightweight observability stack in the observability namespace:
kube-prometheus-stack(Prometheus + Grafana)lokipromtail
Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/ (Flux), migrated from ansible/roles/observability-content/.
Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy.
Access Grafana and Prometheus
Preferred private access:
- Grafana:
http://k8s-cluster-cp-1.<your-tailnet>:30080/ - Prometheus:
http://k8s-cluster-cp-1.<your-tailnet>:30990/ - Flux UI:
http://k8s-cluster-cp-1.<your-tailnet>:30901/
This access path is bootstrapped automatically by Ansible on control_plane[0] using persistent kubectl port-forward systemd services plus tailscale serve, so it survives cluster rebuilds.
Fallback (port-forward from a tailnet-connected machine):
Run from a tailnet-connected machine:
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: admin
Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)
Verify Tailscale exposure
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
If TailscaleProxyReady=False, check:
kubectl -n tailscale-system logs deployment/operator --tail=100
Common cause: OAuth client missing tag/scopes permissions.
Fast dashboard iteration workflow
Use the Deploy Grafana Content workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
ansible/roles/observability-content/templates/grafana-datasources.yaml.j2ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2ansible/dashboards.yml
File Structure
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
│ │ ├── ccm/
│ │ ├── csi/
│ │ ├── tailscale-operator/
│ │ ├── observability-content/
│ │ └── observability/
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
│ ├── ansible.yml
│ └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md
Firewall Rules
| Port | Source | Purpose |
|---|---|---|
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
| 30000-32767 | Optional | NodePorts (disabled by default) |
Operations
Scale Workers
Edit terraform.tfvars:
worker_count = 5
Then:
terraform apply
ansible-playbook site.yml
Upgrade k3s
ansible-playbook site.yml -t upgrade
Destroy Cluster
terraform destroy
Troubleshooting
Check k3s Logs
ssh root@<control-plane-ip> journalctl -u k3s -f
Reset k3s
ansible-playbook site.yml -t reset
Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|---|---|---|---|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| Total | €28.93/mo |
Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Consider adding Hetzner load balancer for API server
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
License
MIT