MichaelFisher1997 25ba4b7115
All checks were successful
Deploy Cluster / Terraform (push) Successful in 49s
Deploy Cluster / Ansible (push) Successful in 5m22s
fix: Add skipEmptyWalArchiveCheck annotation and B2 secret healthcheck to CNPG
- Skip WAL archive emptiness check so recovery works when restoring over
  an existing backup archive in B2
- Add healthCheck for b2-credentials secret in CNPG kustomization to
  prevent recovery from starting before ExternalSecret has synced
2026-03-29 03:15:23 +00:00

Hetzner Kubernetes Cluster

Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.

Architecture

Component Details
Control Plane 3x CX23 (HA)
Workers 4x CX33
Total Cost €28.93/mo
K8s k3s (latest, HA)
Addons Hetzner CCM + CSI + Prometheus + Grafana + Loki
Access SSH/API and Rancher UI restricted to Tailnet
Bootstrap Terraform + Ansible

Cluster Resources

  • 22 vCPU total (6 CP + 16 workers)
  • 44 GB RAM total (12 CP + 32 workers)
  • 440 GB SSD storage
  • 140 TB bandwidth allocation

Prerequisites

1. Hetzner Cloud API Token

  1. Go to Hetzner Cloud Console
  2. Select your project (or create a new one)
  3. Navigate to SecurityAPI Tokens
  4. Click Generate API Token
  5. Set description: k8s-cluster-terraform
  6. Select permissions: Read & Write
  7. Click Generate API Token
  8. Copy the token immediately - it won't be shown again!

2. Backblaze B2 Bucket (for Terraform State)

  1. Go to Backblaze B2
  2. Click Create a Bucket
  3. Set bucket name: k8s-terraform-state (must be globally unique)
  4. Choose Private access
  5. Click Create Bucket
  6. Create application key:
    • Go to App KeysAdd a New Application Key
    • Name: terraform-state
    • Allow access to: k8s-terraform-state bucket only
    • Type: Read and Write
    • Copy keyID (access key) and applicationKey (secret key)
  7. Note your bucket's S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)

3. SSH Key Pair

ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s

4. Local Tools

Setup

1. Clone Repository

git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra

2. Configure Variables

cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars:

hcloud_token = "your-hetzner-api-token"

ssh_public_key  = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"

s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket     = "k8s-terraform-state"

tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet  = "yourtailnet.ts.net"

restrict_api_ssh_to_tailnet = true
tailnet_cidr                = "100.64.0.0/10"
enable_nodeport_public      = false

allowed_ssh_ips = []
allowed_api_ips = []

3. Initialize Terraform

cd terraform

# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint                    = "https://s3.eu-central-003.backblazeb2.com"
bucket                      = "k8s-terraform-state"
access_key                  = "your-backblaze-key-id"
secret_key                  = "your-backblaze-application-key"
skip_requesting_account_id  = true
EOF

terraform init -backend-config=backend.hcl

4. Plan and Apply

terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars

5. Generate Ansible Inventory

cd ../ansible
python3 generate_inventory.py

6. Bootstrap Cluster

ansible-playbook site.yml

7. Get Kubeconfig

export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes

Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (k8s-cluster-cp-1.<your-tailnet>).

Gitea CI/CD

This repository includes Gitea workflows for:

  • terraform-plan: Runs on PRs, shows planned changes
  • terraform-apply: Runs on main branch after merge
  • ansible-deploy: Runs after terraform apply
  • dashboards: Fast workflow that updates Grafana datasources/dashboards only

Required Gitea Secrets

Set these in your Gitea repository settings (SettingsSecretsActions):

Secret Description
HCLOUD_TOKEN Hetzner Cloud API token
S3_ACCESS_KEY Backblaze B2 keyID
S3_SECRET_KEY Backblaze B2 applicationKey
S3_ENDPOINT Backblaze S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)
S3_BUCKET S3 bucket name (e.g., k8s-terraform-state)
TAILSCALE_AUTH_KEY Tailscale auth key for node bootstrap
TAILSCALE_TAILNET Tailnet domain (e.g., yourtailnet.ts.net)
TAILSCALE_OAUTH_CLIENT_ID Tailscale OAuth client ID for Kubernetes Operator
TAILSCALE_OAUTH_CLIENT_SECRET Tailscale OAuth client secret for Kubernetes Operator
DOPPLER_HETZNERTERRA_SERVICE_TOKEN Doppler service token for hetznerterra runtime secrets
GRAFANA_ADMIN_PASSWORD Optional admin password for Grafana (auto-generated if unset)
RUNNER_ALLOWED_CIDRS Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets
SSH_PUBLIC_KEY SSH public key content
SSH_PRIVATE_KEY SSH private key content

GitOps (Flux)

This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.

Stable private-only baseline

The current default target is a deliberately simplified baseline:

  • 1 control plane node
  • 2 worker nodes
  • private Hetzner network only
  • Tailscale for operator access
  • Flux-managed core addons only

Detailed phase gates and success criteria live in STABLE_BASELINE.md.

This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.

Runtime secrets

Runtime cluster secrets are moving to Doppler + External Secrets Operator.

  • Doppler project: hetznerterra
  • Initial auth: service token via DOPPLER_HETZNERTERRA_SERVICE_TOKEN
  • First synced secrets:
    • GRAFANA_ADMIN_PASSWORD
    • WEAVE_GITOPS_ADMIN_USERNAME
    • WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH

Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.

Repository layout

  • clusters/prod/: cluster entrypoint and Flux reconciliation objects
  • clusters/prod/flux-system/: GitRepository source and top-level Kustomization graph
  • infrastructure/: infrastructure addon reconciliation graph
  • infrastructure/addons/*: per-addon manifests for Flux-managed cluster addons
  • apps/: application workload layer (currently scaffolded)

Reconciliation graph

  • infrastructure (top-level)
    • addon-ccm
    • addon-csi depends on addon-ccm
    • addon-tailscale-operator
    • addon-observability
    • addon-observability-content depends on addon-observability
  • apps depends on infrastructure

Bootstrap notes

  1. Install Flux controllers in flux-system.
  2. Create the Flux deploy key/secret named flux-system in flux-system namespace.
  3. Apply clusters/prod/flux-system/ once to establish source + reconciliation graph.
  4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.

Current addon status

  • Core infrastructure addons are Flux-managed from infrastructure/addons/.
  • Active Flux addons for stable baseline: addon-tailscale-operator, addon-tailscale-proxyclass, addon-external-secrets.
  • Deferred addons: addon-ccm, addon-csi, addon-observability, addon-observability-content (to be added after baseline is stable).
  • Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
  • addon-flux-ui is optional for the stable-baseline phase and is not a blocker for rebuild success.

Rancher access

  • Rancher is private-only and exposed through Tailscale at https://rancher.silverside-gopher.ts.net/dashboard/.
  • The public Hetzner load balancer path is not used for Rancher.
  • Rancher uses the CNPG-backed PostgreSQL cluster in cnpg-cluster.

Stable baseline acceptance

A rebuild is considered successful only when all of the following pass without manual intervention:

  • Terraform create succeeds for the default 1 control plane and 2 workers.
  • Ansible bootstrap succeeds end-to-end.
  • All nodes become Ready.
  • Flux core reconciliation is healthy.
  • External Secrets Operator is ready.
  • Tailscale operator is ready.
  • Terraform destroy succeeds cleanly or succeeds after workflow retries.

Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable.

Observability Stack

Flux deploys a lightweight observability stack in the observability namespace:

  • kube-prometheus-stack (Prometheus + Grafana)
  • loki
  • promtail

Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/.

Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy.

Access Grafana and Prometheus

Preferred private access:

  • Grafana: http://k8s-cluster-cp-1.<your-tailnet>:30080/
  • Prometheus: http://k8s-cluster-cp-1.<your-tailnet>:30990/
  • Flux UI: http://k8s-cluster-cp-1.<your-tailnet>:30901/

This access path is bootstrapped automatically by Ansible on control_plane[0] using persistent kubectl port-forward systemd services plus tailscale serve, so it survives cluster rebuilds.

Fallback (port-forward from a tailnet-connected machine):

Run from a tailnet-connected machine:

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090

Then open:

Grafana user: admin Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)

Verify Tailscale exposure

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady

If TailscaleProxyReady=False, check:

kubectl -n tailscale-system logs deployment/operator --tail=100

Common cause: OAuth client missing tag/scopes permissions.

Fast dashboard iteration workflow

Use the Deploy Grafana Content workflow when changing dashboard/data source templates. It avoids full cluster provisioning and only applies Grafana content resources:

  • ansible/roles/observability-content/templates/grafana-datasources.yaml.j2
  • ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2
  • ansible/dashboards.yml

File Structure

.
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── network.tf
│   ├── firewall.tf
│   ├── ssh.tf
│   ├── servers.tf
│   ├── outputs.tf
│   └── backend.tf
├── ansible/
│   ├── inventory.tmpl
│   ├── generate_inventory.py
│   ├── site.yml
│   ├── roles/
│   │   ├── common/
│   │   ├── k3s-server/
│   │   ├── k3s-agent/
│   │   ├── addon-secrets-bootstrap/
│   │   ├── observability-content/
│   │   └── observability/
│   └── ansible.cfg
├── .gitea/
│   └── workflows/
│       ├── terraform.yml
│       ├── ansible.yml
│       └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md

Firewall Rules

Port Source Purpose
22 Tailnet CIDR SSH
6443 Tailnet CIDR + internal Kubernetes API
41641/udp Any Tailscale WireGuard
9345 10.0.0.0/16 k3s Supervisor (HA join)
2379 10.0.0.0/16 etcd Client
2380 10.0.0.0/16 etcd Peer
8472 10.0.0.0/16 Flannel VXLAN
10250 10.0.0.0/16 Kubelet
30000-32767 Optional NodePorts (disabled by default)

Operations

Scale Workers

Edit terraform.tfvars:

worker_count = 5

Then:

terraform apply
ansible-playbook site.yml

Upgrade k3s

ansible-playbook site.yml -t upgrade

Destroy Cluster

terraform destroy

Troubleshooting

Check k3s Logs

ssh root@<control-plane-ip> journalctl -u k3s -f

Reset k3s

ansible-playbook site.yml -t reset

Costs Breakdown

Resource Quantity Unit Price Monthly
CX23 (Control Plane) 3 €2.99 €8.97
CX33 (Workers) 4 €4.99 €19.96
Backblaze B2 ~1 GB Free (first 10GB) €0.00
Total €28.93/mo

Security Notes

  • Control plane has HA (3 nodes, can survive 1 failure)
  • Consider adding Hetzner load balancer for API server
  • Rotate API tokens regularly
  • Use network policies in Kubernetes
  • Enable audit logging for production

License

MIT

Description
No description provided
Readme 968 KiB
Languages
HCL 55%
Jinja 29.8%
Python 10%
Shell 5.2%