HomeInfra/HetznerTerra

Fork 0

Go to file

MichaelFisher1997 e47ec2a3e7

Deploy Cluster / Terraform (push) Successful in 37s

Details

Deploy Cluster / Ansible (push) Successful in 4m30s

Details

Update Weave GitOps to v0.41.0 to support HelmRelease v2 API

Fixes error: 'no matches for kind HelmRelease in version v2beta1'

The cluster uses HelmRelease v2 API but Weave GitOps v0.38.0 was looking
for the old v2beta1 API. Updated image tag to v0.41.0 which supports
the newer API version.

2026-03-24 01:33:10 +00:00

.gitea/workflows

Implement HA control plane with Load Balancer (3-3 topology)

2026-03-23 02:39:39 +00:00

ansible

Use Tailscale DNS names instead of IPs for TLS SANs

2026-03-23 23:50:48 +00:00

apps

feat: migrate observability stack to flux gitops

2026-03-04 23:38:40 +00:00

clusters/prod

fix: stabilize flux and external secrets reconciliation

2026-03-09 02:25:27 +00:00

infrastructure

Update Weave GitOps to v0.41.0 to support HelmRelease v2 API

2026-03-24 01:33:10 +00:00

terraform

Fix HA cluster join via Load Balancer private IP

2026-03-23 02:56:41 +00:00

.gitignore

fix: Remove backend.hcl from git, add to gitignore

2026-02-28 21:12:56 +00:00

README.md

docs: update README for deferred observability baseline

2026-03-22 01:04:53 +00:00

SECRETS_SETUP.md

feat: sync runtime secrets from doppler

2026-03-09 00:25:41 +00:00

STABLE_BASELINE.md

Implement HA control plane with Load Balancer (3-3 topology)

2026-03-23 02:39:39 +00:00

terraform.tfvars.example

fix: avoid server replacement; install tailscale via Ansible

2026-03-01 04:51:19 +00:00

README.md

Hetzner Kubernetes Cluster

Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.

Architecture

Component	Details
Control Plane	3x CX23 (HA)
Workers	4x CX33
Total Cost	€28.93/mo
K8s	k3s (latest, HA)
Addons	Hetzner CCM + CSI + Prometheus + Grafana + Loki
Access	SSH/API restricted to Tailnet
Bootstrap	Terraform + Ansible

Cluster Resources

22 vCPU total (6 CP + 16 workers)
44 GB RAM total (12 CP + 32 workers)
440 GB SSD storage
140 TB bandwidth allocation

Prerequisites

1. Hetzner Cloud API Token

Go to Hetzner Cloud Console
Select your project (or create a new one)
Navigate to Security → API Tokens
Click Generate API Token
Set description: k8s-cluster-terraform
Select permissions: Read & Write
Click Generate API Token
Copy the token immediately - it won't be shown again!

2. Backblaze B2 Bucket (for Terraform State)

Go to Backblaze B2
Click Create a Bucket
Set bucket name: k8s-terraform-state (must be globally unique)
Choose Private access
Click Create Bucket
Create application key:
- Go to App Keys → Add a New Application Key
- Name: terraform-state
- Allow access to: k8s-terraform-state bucket only
- Type: Read and Write
- Copy keyID (access key) and applicationKey (secret key)
Note your bucket's S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)

3. SSH Key Pair

ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s

4. Local Tools

Terraform >= 1.0
Ansible >= 2.9
Python 3 with jinja2 and pyyaml

Setup

1. Clone Repository

git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra

2. Configure Variables

cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars:

hcloud_token = "your-hetzner-api-token"

ssh_public_key  = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"

s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket     = "k8s-terraform-state"

tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet  = "yourtailnet.ts.net"

restrict_api_ssh_to_tailnet = true
tailnet_cidr                = "100.64.0.0/10"
enable_nodeport_public      = false

allowed_ssh_ips = []
allowed_api_ips = []

3. Initialize Terraform

cd terraform

# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint                    = "https://s3.eu-central-003.backblazeb2.com"
bucket                      = "k8s-terraform-state"
access_key                  = "your-backblaze-key-id"
secret_key                  = "your-backblaze-application-key"
skip_requesting_account_id  = true
EOF

terraform init -backend-config=backend.hcl

4. Plan and Apply

terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars

5. Generate Ansible Inventory

cd ../ansible
python3 generate_inventory.py

6. Bootstrap Cluster

ansible-playbook site.yml

7. Get Kubeconfig

export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes

Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (k8s-cluster-cp-1.<your-tailnet>).

Gitea CI/CD

This repository includes Gitea workflows for:

terraform-plan: Runs on PRs, shows planned changes
terraform-apply: Runs on main branch after merge
ansible-deploy: Runs after terraform apply
dashboards: Fast workflow that updates Grafana datasources/dashboards only

Required Gitea Secrets

Set these in your Gitea repository settings (Settings → Secrets → Actions):

Secret	Description
`HCLOUD_TOKEN`	Hetzner Cloud API token
`S3_ACCESS_KEY`	Backblaze B2 keyID
`S3_SECRET_KEY`	Backblaze B2 applicationKey
`S3_ENDPOINT`	Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
`S3_BUCKET`	S3 bucket name (e.g., `k8s-terraform-state`)
`TAILSCALE_AUTH_KEY`	Tailscale auth key for node bootstrap
`TAILSCALE_TAILNET`	Tailnet domain (e.g., `yourtailnet.ts.net`)
`TAILSCALE_OAUTH_CLIENT_ID`	Tailscale OAuth client ID for Kubernetes Operator
`TAILSCALE_OAUTH_CLIENT_SECRET`	Tailscale OAuth client secret for Kubernetes Operator
`DOPPLER_HETZNERTERRA_SERVICE_TOKEN`	Doppler service token for `hetznerterra` runtime secrets
`GRAFANA_ADMIN_PASSWORD`	Optional admin password for Grafana (auto-generated if unset)
`RUNNER_ALLOWED_CIDRS`	Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets
`SSH_PUBLIC_KEY`	SSH public key content
`SSH_PRIVATE_KEY`	SSH private key content

GitOps (Flux)

This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.

Stable private-only baseline

The current default target is a deliberately simplified baseline:

1 control plane node
2 worker nodes
private Hetzner network only
Tailscale for operator access
Flux-managed core addons only

Detailed phase gates and success criteria live in STABLE_BASELINE.md.

This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.

Runtime secrets

Runtime cluster secrets are moving to Doppler + External Secrets Operator.

Doppler project: hetznerterra
Initial auth: service token via DOPPLER_HETZNERTERRA_SERVICE_TOKEN
First synced secrets:
- GRAFANA_ADMIN_PASSWORD
- WEAVE_GITOPS_ADMIN_USERNAME
- WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH

Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.

Repository layout

clusters/prod/: cluster entrypoint and Flux reconciliation objects
clusters/prod/flux-system/: GitRepository source and top-level Kustomization graph
infrastructure/: infrastructure addon reconciliation graph
infrastructure/addons/*: per-addon manifests for Flux-managed cluster addons
apps/: application workload layer (currently scaffolded)

Reconciliation graph

infrastructure (top-level)
- addon-ccm
- addon-csi depends on addon-ccm
- addon-tailscale-operator
- addon-observability
- addon-observability-content depends on addon-observability
apps depends on infrastructure

Bootstrap notes

Install Flux controllers in flux-system.
Create the Flux deploy key/secret named flux-system in flux-system namespace.
Apply clusters/prod/flux-system/ once to establish source + reconciliation graph.
Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.

Current addon status

Core infrastructure addons are Flux-managed from infrastructure/addons/.
Active Flux addons for stable baseline: addon-tailscale-operator, addon-tailscale-proxyclass, addon-external-secrets.
Deferred addons: addon-ccm, addon-csi, addon-observability, addon-observability-content (to be added after baseline is stable).
Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
addon-flux-ui is optional for the stable-baseline phase and is not a blocker for rebuild success.

Stable baseline acceptance

A rebuild is considered successful only when all of the following pass without manual intervention:

Terraform create succeeds for the default 1 control plane and 2 workers.
Ansible bootstrap succeeds end-to-end.
All nodes become Ready.
Flux core reconciliation is healthy.
External Secrets Operator is ready.
Tailscale operator is ready.
Terraform destroy succeeds cleanly or succeeds after workflow retries.

Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable.

Observability Stack

Flux deploys a lightweight observability stack in the observability namespace:

kube-prometheus-stack (Prometheus + Grafana)
loki
promtail

Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/.

Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy.

Access Grafana and Prometheus

Preferred private access:

Grafana: http://k8s-cluster-cp-1.<your-tailnet>:30080/
Prometheus: http://k8s-cluster-cp-1.<your-tailnet>:30990/
Flux UI: http://k8s-cluster-cp-1.<your-tailnet>:30901/

This access path is bootstrapped automatically by Ansible on control_plane[0] using persistent kubectl port-forward systemd services plus tailscale serve, so it survives cluster rebuilds.

Fallback (port-forward from a tailnet-connected machine):

Run from a tailnet-connected machine:

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090

Then open:

Grafana: http://127.0.0.1:3000
Prometheus: http://127.0.0.1:9090

Grafana user: admin Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)

Verify Tailscale exposure

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady

If TailscaleProxyReady=False, check:

kubectl -n tailscale-system logs deployment/operator --tail=100

Common cause: OAuth client missing tag/scopes permissions.

Fast dashboard iteration workflow

Use the Deploy Grafana Content workflow when changing dashboard/data source templates. It avoids full cluster provisioning and only applies Grafana content resources:

ansible/roles/observability-content/templates/grafana-datasources.yaml.j2
ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2
ansible/dashboards.yml

File Structure

.
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── network.tf
│   ├── firewall.tf
│   ├── ssh.tf
│   ├── servers.tf
│   ├── outputs.tf
│   └── backend.tf
├── ansible/
│   ├── inventory.tmpl
│   ├── generate_inventory.py
│   ├── site.yml
│   ├── roles/
│   │   ├── common/
│   │   ├── k3s-server/
│   │   ├── k3s-agent/
│   │   ├── addon-secrets-bootstrap/
│   │   ├── observability-content/
│   │   └── observability/
│   └── ansible.cfg
├── .gitea/
│   └── workflows/
│       ├── terraform.yml
│       ├── ansible.yml
│       └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md

Firewall Rules

Port	Source	Purpose
22	Tailnet CIDR	SSH
6443	Tailnet CIDR + internal	Kubernetes API
41641/udp	Any	Tailscale WireGuard
9345	10.0.0.0/16	k3s Supervisor (HA join)
2379	10.0.0.0/16	etcd Client
2380	10.0.0.0/16	etcd Peer
8472	10.0.0.0/16	Flannel VXLAN
10250	10.0.0.0/16	Kubelet
30000-32767	Optional	NodePorts (disabled by default)

Operations

Scale Workers

Edit terraform.tfvars:

worker_count = 5

Then:

terraform apply
ansible-playbook site.yml

Upgrade k3s

ansible-playbook site.yml -t upgrade

Destroy Cluster

terraform destroy

Troubleshooting

Check k3s Logs

ssh root@<control-plane-ip> journalctl -u k3s -f

Reset k3s

ansible-playbook site.yml -t reset

Costs Breakdown

Resource	Quantity	Unit Price	Monthly
CX23 (Control Plane)	3	€2.99	€8.97
CX33 (Workers)	4	€4.99	€19.96
Backblaze B2	~1 GB	Free (first 10GB)	€0.00
Total			€28.93/mo

Security Notes

Control plane has HA (3 nodes, can survive 1 failure)
Consider adding Hetzner load balancer for API server
Rotate API tokens regularly
Use network policies in Kubernetes
Enable audit logging for production

License

MIT