2026-02-28 20:24:55 +00:00
# Hetzner Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
## Architecture
| Component | Details |
|-----------|---------|
| **Control Plane ** | 3x CX23 (HA) |
2026-04-18 19:59:13 +00:00
| **Workers ** | 3x CX33 |
2026-02-28 20:24:55 +00:00
| **K8s ** | k3s (latest, HA) |
2026-03-02 01:33:41 +00:00
| **Addons ** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
2026-04-18 19:59:13 +00:00
| **Access ** | SSH/API and private services restricted to Tailnet |
| **Bootstrap ** | Terraform + Ansible + Flux |
2026-02-28 20:24:55 +00:00
## Prerequisites
### 1. Hetzner Cloud API Token
1. Go to [Hetzner Cloud Console ](https://console.hetzner.com/ )
2. Select your project (or create a new one)
3. Navigate to **Security ** → **API Tokens **
4. Click **Generate API Token **
5. Set description: `k8s-cluster-terraform`
6. Select permissions: **Read & Write **
7. Click **Generate API Token **
8. **Copy the token immediately ** - it won't be shown again!
### 2. Backblaze B2 Bucket (for Terraform State)
1. Go to [Backblaze B2 ](https://secure.backblaze.com/b2_buckets.htm )
2. Click **Create a Bucket **
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
4. Choose **Private ** access
5. Click **Create Bucket **
6. Create application key:
- Go to **App Keys ** → **Add a New Application Key **
- Name: `terraform-state`
- Allow access to: `k8s-terraform-state` bucket only
- Type: **Read and Write **
- Copy **keyID ** (access key) and **applicationKey ** (secret key)
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` )
### 3. SSH Key Pair
``` bash
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
```
### 4. Local Tools
- [Terraform ](https://terraform.io/downloads ) >= 1.0
- [Ansible ](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html ) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
## Setup
### 1. Clone Repository
``` bash
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
```
### 2. Configure Variables
``` bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars` :
``` hcl
hcloud_token = "your-hetzner-api-token"
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
2026-03-01 04:04:56 +00:00
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = [ ]
allowed_api_ips = [ ]
2026-02-28 20:24:55 +00:00
```
### 3. Initialize Terraform
``` bash
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
2026-02-28 21:01:58 +00:00
endpoint = "https://s3.eu-central-003.backblazeb2.com"
2026-02-28 20:58:40 +00:00
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
2026-02-28 20:24:55 +00:00
EOF
terraform init -backend-config= backend.hcl
```
### 4. Plan and Apply
``` bash
terraform plan -var-file= ../terraform.tfvars
terraform apply -var-file= ../terraform.tfvars
```
### 5. Generate Ansible Inventory
``` bash
cd ../ansible
python3 generate_inventory.py
```
### 6. Bootstrap Cluster
``` bash
ansible-playbook site.yml
```
### 7. Get Kubeconfig
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl get nodes
```
2026-04-18 19:59:13 +00:00
Use `scripts/refresh-kubeconfig.sh <cp1-public-ip>` to refresh kubeconfig against the primary control-plane public IP after rebuilds.
2026-03-01 17:12:12 +00:00
2026-02-28 20:24:55 +00:00
## Gitea CI/CD
This repository includes Gitea workflows for:
2026-04-18 19:59:13 +00:00
- **deploy**: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
- **destroy**: Cluster teardown with backup-aware cleanup
2026-03-04 03:36:01 +00:00
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
2026-02-28 20:24:55 +00:00
### Required Gitea Secrets
Set these in your Gitea repository settings (**Settings** → **Secrets ** → **Actions ** ):
| Secret | Description |
|--------|-------------|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` ) |
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state` ) |
2026-03-01 04:04:56 +00:00
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net` ) |
2026-03-04 01:37:00 +00:00
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
2026-03-09 00:25:41 +00:00
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
2026-03-02 01:33:41 +00:00
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
2026-03-01 14:47:32 +00:00
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
2026-02-28 20:24:55 +00:00
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
2026-03-04 23:38:40 +00:00
## GitOps (Flux)
2026-03-17 01:21:23 +00:00
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
2026-03-04 23:38:40 +00:00
2026-03-20 02:24:37 +00:00
### Stable private-only baseline
2026-04-18 19:59:13 +00:00
The current default target is the HA private baseline:
2026-03-20 02:24:37 +00:00
2026-04-18 19:59:13 +00:00
- `3` control plane nodes
- `3` worker nodes
2026-03-20 02:24:37 +00:00
- private Hetzner network only
2026-04-18 19:59:13 +00:00
- Tailscale for operator and service access
- Flux-managed platform addons with `apps` suspended by default
2026-03-20 02:24:37 +00:00
Detailed phase gates and success criteria live in `STABLE_BASELINE.md` .
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
2026-03-09 00:25:41 +00:00
### Runtime secrets
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
- Doppler project: `hetznerterra`
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
- First synced secrets:
- `GRAFANA_ADMIN_PASSWORD`
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
2026-03-04 23:38:40 +00:00
### Repository layout
- `clusters/prod/` : cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/` : `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/` : infrastructure addon reconciliation graph
2026-03-17 01:04:02 +00:00
- `infrastructure/addons/*` : per-addon manifests for Flux-managed cluster addons
2026-03-04 23:38:40 +00:00
- `apps/` : application workload layer (currently scaffolded)
### Reconciliation graph
- `infrastructure` (top-level)
- `addon-ccm`
- `addon-csi` depends on `addon-ccm`
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
### Bootstrap notes
1. Install Flux controllers in `flux-system` .
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
2026-03-17 01:04:02 +00:00
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
2026-03-04 23:38:40 +00:00
2026-03-17 01:21:23 +00:00
### Current addon status
2026-03-04 23:38:40 +00:00
2026-03-17 01:21:23 +00:00
- Core infrastructure addons are Flux-managed from `infrastructure/addons/` .
2026-04-18 19:59:13 +00:00
- Active Flux addons for the current baseline: `addon-ccm` , `addon-csi` , `addon-cert-manager` , `addon-external-secrets` , `addon-tailscale-operator` , `addon-tailscale-proxyclass` , `addon-observability` , `addon-observability-content` , `addon-rancher` , `addon-rancher-config` , `addon-rancher-backup` , `addon-rancher-backup-config` .
- `apps` remains suspended until workload rollout is explicitly enabled.
- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
2026-04-18 18:44:55 +00:00
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
2026-03-20 02:24:37 +00:00
2026-03-28 22:13:37 +00:00
### Rancher access
2026-04-18 19:59:13 +00:00
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/` .
2026-03-28 22:13:37 +00:00
- The public Hetzner load balancer path is not used for Rancher.
2026-04-18 19:59:13 +00:00
- Rancher stores state in embedded etcd; no external database is used.
2026-03-28 22:13:37 +00:00
2026-03-20 02:24:37 +00:00
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
2026-04-18 19:59:13 +00:00
- Terraform create succeeds for the default `3` control planes and `3` workers.
2026-03-20 02:24:37 +00:00
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready` .
2026-03-22 01:04:53 +00:00
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
2026-04-18 19:59:13 +00:00
- Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
2026-03-20 02:24:37 +00:00
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
2026-03-04 23:38:40 +00:00
2026-03-02 01:33:41 +00:00
## Observability Stack
2026-03-05 00:43:29 +00:00
Flux deploys a lightweight observability stack in the `observability` namespace:
2026-03-02 01:33:41 +00:00
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
2026-03-17 01:21:23 +00:00
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/` .
2026-03-04 03:36:01 +00:00
2026-04-18 18:44:55 +00:00
Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
2026-03-02 01:33:41 +00:00
### Access Grafana and Prometheus
2026-03-08 04:16:06 +00:00
Preferred private access:
2026-03-04 01:37:00 +00:00
2026-04-18 18:44:55 +00:00
- Grafana: `http://grafana.silverside-gopher.ts.net/`
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
2026-03-04 01:37:00 +00:00
Fallback (port-forward from a tailnet-connected machine):
2026-03-02 01:33:41 +00:00
Run from a tailnet-connected machine:
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: `admin`
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
2026-03-04 01:37:00 +00:00
### Verify Tailscale exposure
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl -n tailscale-system get pods
2026-04-18 19:59:13 +00:00
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
2026-03-04 01:37:00 +00:00
```
If `TailscaleProxyReady=False` , check:
``` bash
kubectl -n tailscale-system logs deployment/operator --tail= 100
```
Common cause: OAuth client missing tag/scopes permissions.
2026-03-04 03:36:01 +00:00
### Fast dashboard iteration workflow
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
2026-02-28 20:24:55 +00:00
## File Structure
```
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
2026-03-17 01:04:02 +00:00
│ │ ├── addon-secrets-bootstrap/
2026-03-04 03:36:01 +00:00
│ │ ├── observability-content/
2026-03-02 01:33:41 +00:00
│ │ └── observability/
2026-02-28 20:24:55 +00:00
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
2026-03-04 03:36:01 +00:00
│ ├── ansible.yml
│ └── dashboards.yml
2026-02-28 20:24:55 +00:00
├── outputs/
├── terraform.tfvars.example
└── README.md
```
## Firewall Rules
| Port | Source | Purpose |
|------|--------|---------|
2026-03-01 04:04:56 +00:00
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
2026-02-28 20:24:55 +00:00
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
2026-03-01 04:04:56 +00:00
| 30000-32767 | Optional | NodePorts (disabled by default) |
2026-02-28 20:24:55 +00:00
## Operations
### Scale Workers
Edit `terraform.tfvars` :
``` hcl
worker_count = 5
```
Then:
``` bash
terraform apply
ansible-playbook site.yml
```
### Upgrade k3s
``` bash
ansible-playbook site.yml -t upgrade
```
### Destroy Cluster
``` bash
terraform destroy
```
## Troubleshooting
### Check k3s Logs
``` bash
ssh root@<control-plane-ip> journalctl -u k3s -f
```
### Reset k3s
``` bash
ansible-playbook site.yml -t reset
```
## Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|----------|----------|------------|---------|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| **Total ** | | | * * €28.93/mo** |
## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Consider adding Hetzner load balancer for API server
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
## License
MIT