HomeInfra/HetznerTerra

Fork 0

Go to file

MichaelFisher1997 25ba4b7115

Deploy Cluster / Terraform (push) Successful in 49s

Details

Deploy Cluster / Ansible (push) Successful in 5m22s

Details

fix: Add skipEmptyWalArchiveCheck annotation and B2 secret healthcheck to CNPG

- Skip WAL archive emptiness check so recovery works when restoring over
  an existing backup archive in B2
- Add healthCheck for b2-credentials secret in CNPG kustomization to
  prevent recovery from starting before ExternalSecret has synced

2026-03-29 03:15:23 +00:00

.gitea/workflows

feat: Add CloudNativePG with B2 backups for persistent Rancher database

2026-03-25 23:06:45 +00:00

ansible

Fix CNI: configure flannel to use private network interface (enp7s0) instead of public

2026-03-25 01:44:33 +00:00

apps

feat: migrate observability stack to flux gitops

2026-03-04 23:38:40 +00:00

clusters/prod

fix: stabilize flux and external secrets reconciliation

2026-03-09 02:25:27 +00:00

infrastructure

fix: Add skipEmptyWalArchiveCheck annotation and B2 secret healthcheck to CNPG

2026-03-29 03:15:23 +00:00

terraform

chore: Add HTTP/HTTPS firewall rules for Load Balancer

2026-03-26 22:36:13 +00:00

.gitignore

fix: Remove backend.hcl from git, add to gitignore

2026-02-28 21:12:56 +00:00

AGENTS.md

docs: Add agent guidance and sync Rancher docs

2026-03-28 22:13:37 +00:00

README.md

fix: Restore canonical Rancher tailnet hostname

2026-03-29 00:00:39 +00:00

SECRETS_SETUP.md

feat: sync runtime secrets from doppler

2026-03-09 00:25:41 +00:00

STABLE_BASELINE.md

fix: Restore canonical Rancher tailnet hostname

2026-03-29 00:00:39 +00:00

terraform.tfvars.example

fix: avoid server replacement; install tailscale via Ansible

2026-03-01 04:51:19 +00:00

README.md

Hetzner Kubernetes Cluster

Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.

Architecture

Component	Details
Control Plane	3x CX23 (HA)
Workers	4x CX33
Total Cost	€28.93/mo
K8s	k3s (latest, HA)
Addons	Hetzner CCM + CSI + Prometheus + Grafana + Loki
Access	SSH/API and Rancher UI restricted to Tailnet
Bootstrap	Terraform + Ansible

Cluster Resources

22 vCPU total (6 CP + 16 workers)
44 GB RAM total (12 CP + 32 workers)
440 GB SSD storage
140 TB bandwidth allocation

Prerequisites

1. Hetzner Cloud API Token

Go to Hetzner Cloud Console
Select your project (or create a new one)
Navigate to Security → API Tokens
Click Generate API Token
Set description: k8s-cluster-terraform
Select permissions: Read & Write
Click Generate API Token
Copy the token immediately - it won't be shown again!

2. Backblaze B2 Bucket (for Terraform State)

Go to Backblaze B2
Click Create a Bucket
Set bucket name: k8s-terraform-state (must be globally unique)
Choose Private access
Click Create Bucket
Create application key:
- Go to App Keys → Add a New Application Key
- Name: terraform-state
- Allow access to: k8s-terraform-state bucket only
- Type: Read and Write
- Copy keyID (access key) and applicationKey (secret key)
Note your bucket's S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)

3. SSH Key Pair

ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s

4. Local Tools

Terraform >= 1.0
Ansible >= 2.9
Python 3 with jinja2 and pyyaml

Setup

1. Clone Repository

git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra

2. Configure Variables

cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars:

hcloud_token = "your-hetzner-api-token"

ssh_public_key  = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"

s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket     = "k8s-terraform-state"

tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet  = "yourtailnet.ts.net"

restrict_api_ssh_to_tailnet = true
tailnet_cidr                = "100.64.0.0/10"
enable_nodeport_public      = false

allowed_ssh_ips = []
allowed_api_ips = []

3. Initialize Terraform

cd terraform

# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint                    = "https://s3.eu-central-003.backblazeb2.com"
bucket                      = "k8s-terraform-state"
access_key                  = "your-backblaze-key-id"
secret_key                  = "your-backblaze-application-key"
skip_requesting_account_id  = true
EOF

terraform init -backend-config=backend.hcl

4. Plan and Apply

terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars

5. Generate Ansible Inventory

cd ../ansible
python3 generate_inventory.py

6. Bootstrap Cluster

ansible-playbook site.yml

7. Get Kubeconfig

export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes

Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (k8s-cluster-cp-1.<your-tailnet>).

Gitea CI/CD

This repository includes Gitea workflows for:

terraform-plan: Runs on PRs, shows planned changes
terraform-apply: Runs on main branch after merge
ansible-deploy: Runs after terraform apply
dashboards: Fast workflow that updates Grafana datasources/dashboards only

Required Gitea Secrets

Set these in your Gitea repository settings (Settings → Secrets → Actions):

Secret	Description
`HCLOUD_TOKEN`	Hetzner Cloud API token
`S3_ACCESS_KEY`	Backblaze B2 keyID
`S3_SECRET_KEY`	Backblaze B2 applicationKey
`S3_ENDPOINT`	Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
`S3_BUCKET`	S3 bucket name (e.g., `k8s-terraform-state`)
`TAILSCALE_AUTH_KEY`	Tailscale auth key for node bootstrap
`TAILSCALE_TAILNET`	Tailnet domain (e.g., `yourtailnet.ts.net`)
`TAILSCALE_OAUTH_CLIENT_ID`	Tailscale OAuth client ID for Kubernetes Operator
`TAILSCALE_OAUTH_CLIENT_SECRET`	Tailscale OAuth client secret for Kubernetes Operator
`DOPPLER_HETZNERTERRA_SERVICE_TOKEN`	Doppler service token for `hetznerterra` runtime secrets
`GRAFANA_ADMIN_PASSWORD`	Optional admin password for Grafana (auto-generated if unset)
`RUNNER_ALLOWED_CIDRS`	Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets
`SSH_PUBLIC_KEY`	SSH public key content
`SSH_PRIVATE_KEY`	SSH private key content

GitOps (Flux)

This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.

Stable private-only baseline

The current default target is a deliberately simplified baseline:

1 control plane node
2 worker nodes
private Hetzner network only
Tailscale for operator access
Flux-managed core addons only

Detailed phase gates and success criteria live in STABLE_BASELINE.md.

This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.

Runtime secrets

Runtime cluster secrets are moving to Doppler + External Secrets Operator.

Doppler project: hetznerterra
Initial auth: service token via DOPPLER_HETZNERTERRA_SERVICE_TOKEN
First synced secrets:
- GRAFANA_ADMIN_PASSWORD
- WEAVE_GITOPS_ADMIN_USERNAME
- WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH

Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.

Repository layout

clusters/prod/: cluster entrypoint and Flux reconciliation objects
clusters/prod/flux-system/: GitRepository source and top-level Kustomization graph
infrastructure/: infrastructure addon reconciliation graph
infrastructure/addons/*: per-addon manifests for Flux-managed cluster addons
apps/: application workload layer (currently scaffolded)

Reconciliation graph

infrastructure (top-level)
- addon-ccm
- addon-csi depends on addon-ccm
- addon-tailscale-operator
- addon-observability
- addon-observability-content depends on addon-observability
apps depends on infrastructure

Bootstrap notes

Install Flux controllers in flux-system.
Create the Flux deploy key/secret named flux-system in flux-system namespace.
Apply clusters/prod/flux-system/ once to establish source + reconciliation graph.
Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.

Current addon status

Core infrastructure addons are Flux-managed from infrastructure/addons/.
Active Flux addons for stable baseline: addon-tailscale-operator, addon-tailscale-proxyclass, addon-external-secrets.
Deferred addons: addon-ccm, addon-csi, addon-observability, addon-observability-content (to be added after baseline is stable).
Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
addon-flux-ui is optional for the stable-baseline phase and is not a blocker for rebuild success.

Rancher access

Rancher is private-only and exposed through Tailscale at https://rancher.silverside-gopher.ts.net/dashboard/.
The public Hetzner load balancer path is not used for Rancher.
Rancher uses the CNPG-backed PostgreSQL cluster in cnpg-cluster.

Stable baseline acceptance

A rebuild is considered successful only when all of the following pass without manual intervention:

Terraform create succeeds for the default 1 control plane and 2 workers.
Ansible bootstrap succeeds end-to-end.
All nodes become Ready.
Flux core reconciliation is healthy.
External Secrets Operator is ready.
Tailscale operator is ready.
Terraform destroy succeeds cleanly or succeeds after workflow retries.

Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable.

Observability Stack

Flux deploys a lightweight observability stack in the observability namespace:

kube-prometheus-stack (Prometheus + Grafana)
loki
promtail

Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/.

Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy.

Access Grafana and Prometheus

Preferred private access:

Grafana: http://k8s-cluster-cp-1.<your-tailnet>:30080/
Prometheus: http://k8s-cluster-cp-1.<your-tailnet>:30990/
Flux UI: http://k8s-cluster-cp-1.<your-tailnet>:30901/

This access path is bootstrapped automatically by Ansible on control_plane[0] using persistent kubectl port-forward systemd services plus tailscale serve, so it survives cluster rebuilds.

Fallback (port-forward from a tailnet-connected machine):

Run from a tailnet-connected machine:

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090

Then open:

Grafana: http://127.0.0.1:3000
Prometheus: http://127.0.0.1:9090

Grafana user: admin Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)

Verify Tailscale exposure

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady

If TailscaleProxyReady=False, check:

kubectl -n tailscale-system logs deployment/operator --tail=100

Common cause: OAuth client missing tag/scopes permissions.

Fast dashboard iteration workflow

Use the Deploy Grafana Content workflow when changing dashboard/data source templates. It avoids full cluster provisioning and only applies Grafana content resources:

ansible/roles/observability-content/templates/grafana-datasources.yaml.j2
ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2
ansible/dashboards.yml

File Structure

.
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── network.tf
│   ├── firewall.tf
│   ├── ssh.tf
│   ├── servers.tf
│   ├── outputs.tf
│   └── backend.tf
├── ansible/
│   ├── inventory.tmpl
│   ├── generate_inventory.py
│   ├── site.yml
│   ├── roles/
│   │   ├── common/
│   │   ├── k3s-server/
│   │   ├── k3s-agent/
│   │   ├── addon-secrets-bootstrap/
│   │   ├── observability-content/
│   │   └── observability/
│   └── ansible.cfg
├── .gitea/
│   └── workflows/
│       ├── terraform.yml
│       ├── ansible.yml
│       └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md

Firewall Rules

Port	Source	Purpose
22	Tailnet CIDR	SSH
6443	Tailnet CIDR + internal	Kubernetes API
41641/udp	Any	Tailscale WireGuard
9345	10.0.0.0/16	k3s Supervisor (HA join)
2379	10.0.0.0/16	etcd Client
2380	10.0.0.0/16	etcd Peer
8472	10.0.0.0/16	Flannel VXLAN
10250	10.0.0.0/16	Kubelet
30000-32767	Optional	NodePorts (disabled by default)

Operations

Scale Workers

Edit terraform.tfvars:

worker_count = 5

Then:

terraform apply
ansible-playbook site.yml

Upgrade k3s

ansible-playbook site.yml -t upgrade

Destroy Cluster

terraform destroy

Troubleshooting

Check k3s Logs

ssh root@<control-plane-ip> journalctl -u k3s -f

Reset k3s

ansible-playbook site.yml -t reset

Costs Breakdown

Resource	Quantity	Unit Price	Monthly
CX23 (Control Plane)	3	€2.99	€8.97
CX33 (Workers)	4	€4.99	€19.96
Backblaze B2	~1 GB	Free (first 10GB)	€0.00
Total			€28.93/mo

Security Notes

Control plane has HA (3 nodes, can survive 1 failure)
Consider adding Hetzner load balancer for API server
Rotate API tokens regularly
Use network policies in Kubernetes
Enable audit logging for production

License

MIT