2026-02-28 20:24:55 +00:00
# Hetzner Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
## Architecture
| Component | Details |
|-----------|---------|
| **Control Plane ** | 3x CX23 (HA) |
| **Workers ** | 4x CX33 |
| **Total Cost ** | €28.93/mo |
| **K8s ** | k3s (latest, HA) |
2026-03-02 01:33:41 +00:00
| **Addons ** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
2026-03-01 04:04:56 +00:00
| **Access ** | SSH/API restricted to Tailnet |
2026-02-28 20:24:55 +00:00
| **Bootstrap ** | Terraform + Ansible |
### Cluster Resources
- 22 vCPU total (6 CP + 16 workers)
- 44 GB RAM total (12 CP + 32 workers)
- 440 GB SSD storage
- 140 TB bandwidth allocation
## Prerequisites
### 1. Hetzner Cloud API Token
1. Go to [Hetzner Cloud Console ](https://console.hetzner.com/ )
2. Select your project (or create a new one)
3. Navigate to **Security ** → **API Tokens **
4. Click **Generate API Token **
5. Set description: `k8s-cluster-terraform`
6. Select permissions: **Read & Write **
7. Click **Generate API Token **
8. **Copy the token immediately ** - it won't be shown again!
### 2. Backblaze B2 Bucket (for Terraform State)
1. Go to [Backblaze B2 ](https://secure.backblaze.com/b2_buckets.htm )
2. Click **Create a Bucket **
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
4. Choose **Private ** access
5. Click **Create Bucket **
6. Create application key:
- Go to **App Keys ** → **Add a New Application Key **
- Name: `terraform-state`
- Allow access to: `k8s-terraform-state` bucket only
- Type: **Read and Write **
- Copy **keyID ** (access key) and **applicationKey ** (secret key)
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` )
### 3. SSH Key Pair
```bash
ssh-keygen -t ed25519 -C "k8s@hetzner " -f ~/.ssh/hetzner_k8s
```
### 4. Local Tools
- [Terraform ](https://terraform.io/downloads ) >= 1.0
- [Ansible ](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html ) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
## Setup
### 1. Clone Repository
```bash
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
```
### 2. Configure Variables
```bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars` :
```hcl
hcloud_token = "your-hetzner-api-token"
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
2026-03-01 04:04:56 +00:00
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = []
allowed_api_ips = []
2026-02-28 20:24:55 +00:00
```
### 3. Initialize Terraform
```bash
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
2026-02-28 21:01:58 +00:00
endpoint = "https://s3.eu-central-003.backblazeb2.com"
2026-02-28 20:58:40 +00:00
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
2026-02-28 20:24:55 +00:00
EOF
terraform init -backend-config=backend.hcl
```
### 4. Plan and Apply
```bash
terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars
```
### 5. Generate Ansible Inventory
```bash
cd ../ansible
python3 generate_inventory.py
```
### 6. Bootstrap Cluster
```bash
ansible-playbook site.yml
```
### 7. Get Kubeconfig
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes
```
2026-03-01 17:12:12 +00:00
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.<your-tailnet>` ).
2026-02-28 20:24:55 +00:00
## Gitea CI/CD
This repository includes Gitea workflows for:
- **terraform-plan**: Runs on PRs, shows planned changes
- **terraform-apply**: Runs on main branch after merge
- **ansible-deploy**: Runs after terraform apply
2026-03-04 03:36:01 +00:00
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
2026-02-28 20:24:55 +00:00
### Required Gitea Secrets
Set these in your Gitea repository settings (**Settings** → **Secrets ** → **Actions ** ):
| Secret | Description |
|--------|-------------|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` ) |
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state` ) |
2026-03-01 04:04:56 +00:00
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net` ) |
2026-03-04 01:37:00 +00:00
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
2026-03-09 00:25:41 +00:00
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
2026-03-02 01:33:41 +00:00
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
2026-03-01 14:47:32 +00:00
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
2026-02-28 20:24:55 +00:00
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
2026-03-04 23:38:40 +00:00
## GitOps (Flux)
2026-03-17 01:21:23 +00:00
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
2026-03-04 23:38:40 +00:00
2026-03-20 02:24:37 +00:00
### Stable private-only baseline
The current default target is a deliberately simplified baseline:
- `1` control plane node
- `2` worker nodes
- private Hetzner network only
- Tailscale for operator access
- Flux-managed core addons only
Detailed phase gates and success criteria live in `STABLE_BASELINE.md` .
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
2026-03-09 00:25:41 +00:00
### Runtime secrets
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
- Doppler project: `hetznerterra`
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
- First synced secrets:
- `GRAFANA_ADMIN_PASSWORD`
- `WEAVE_GITOPS_ADMIN_USERNAME`
- `WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH`
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
2026-03-04 23:38:40 +00:00
### Repository layout
- `clusters/prod/` : cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/` : `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/` : infrastructure addon reconciliation graph
2026-03-17 01:04:02 +00:00
- `infrastructure/addons/*` : per-addon manifests for Flux-managed cluster addons
2026-03-04 23:38:40 +00:00
- `apps/` : application workload layer (currently scaffolded)
### Reconciliation graph
- `infrastructure` (top-level)
- `addon-ccm`
- `addon-csi` depends on `addon-ccm`
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
### Bootstrap notes
1. Install Flux controllers in `flux-system` .
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
2026-03-17 01:04:02 +00:00
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
2026-03-04 23:38:40 +00:00
2026-03-17 01:21:23 +00:00
### Current addon status
2026-03-04 23:38:40 +00:00
2026-03-17 01:21:23 +00:00
- Core infrastructure addons are Flux-managed from `infrastructure/addons/` .
2026-03-22 01:04:53 +00:00
- Active Flux addons for stable baseline: `addon-tailscale-operator` , `addon-tailscale-proxyclass` , `addon-external-secrets` .
- Deferred addons: `addon-ccm` , `addon-csi` , `addon-observability` , `addon-observability-content` (to be added after baseline is stable).
2026-03-17 01:21:23 +00:00
- Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
2026-03-20 02:24:37 +00:00
- `addon-flux-ui` is optional for the stable-baseline phase and is not a blocker for rebuild success.
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
- Terraform create succeeds for the default `1` control plane and `2` workers.
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready` .
2026-03-22 01:04:53 +00:00
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
2026-03-20 02:24:37 +00:00
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
2026-03-04 23:38:40 +00:00
2026-03-22 01:04:53 +00:00
_Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable._
2026-03-02 01:33:41 +00:00
## Observability Stack
2026-03-05 00:43:29 +00:00
Flux deploys a lightweight observability stack in the `observability` namespace:
2026-03-02 01:33:41 +00:00
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
2026-03-17 01:21:23 +00:00
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/` .
2026-03-04 03:36:01 +00:00
2026-03-07 01:04:03 +00:00
Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy.
2026-03-02 01:33:41 +00:00
### Access Grafana and Prometheus
2026-03-08 04:16:06 +00:00
Preferred private access:
2026-03-04 01:37:00 +00:00
2026-03-08 04:16:06 +00:00
- Grafana: `http://k8s-cluster-cp-1.<your-tailnet>:30080/`
- Prometheus: `http://k8s-cluster-cp-1.<your-tailnet>:30990/`
- Flux UI: `http://k8s-cluster-cp-1.<your-tailnet>:30901/`
This access path is bootstrapped automatically by Ansible on `control_plane[0]` using persistent `kubectl port-forward` systemd services plus `tailscale serve` , so it survives cluster rebuilds.
2026-03-04 01:37:00 +00:00
Fallback (port-forward from a tailnet-connected machine):
2026-03-02 01:33:41 +00:00
Run from a tailnet-connected machine:
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: `admin`
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
2026-03-04 01:37:00 +00:00
### Verify Tailscale exposure
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
```
If `TailscaleProxyReady=False` , check:
```bash
kubectl -n tailscale-system logs deployment/operator --tail=100
```
Common cause: OAuth client missing tag/scopes permissions.
2026-03-04 03:36:01 +00:00
### Fast dashboard iteration workflow
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
2026-02-28 20:24:55 +00:00
## File Structure
```
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
2026-03-17 01:04:02 +00:00
│ │ ├── addon-secrets-bootstrap/
2026-03-04 03:36:01 +00:00
│ │ ├── observability-content/
2026-03-02 01:33:41 +00:00
│ │ └── observability/
2026-02-28 20:24:55 +00:00
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
2026-03-04 03:36:01 +00:00
│ ├── ansible.yml
│ └── dashboards.yml
2026-02-28 20:24:55 +00:00
├── outputs/
├── terraform.tfvars.example
└── README.md
```
## Firewall Rules
| Port | Source | Purpose |
|------|--------|---------|
2026-03-01 04:04:56 +00:00
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
2026-02-28 20:24:55 +00:00
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
2026-03-01 04:04:56 +00:00
| 30000-32767 | Optional | NodePorts (disabled by default) |
2026-02-28 20:24:55 +00:00
## Operations
### Scale Workers
Edit `terraform.tfvars` :
```hcl
worker_count = 5
```
Then:
```bash
terraform apply
ansible-playbook site.yml
```
### Upgrade k3s
```bash
ansible-playbook site.yml -t upgrade
```
### Destroy Cluster
```bash
terraform destroy
```
## Troubleshooting
### Check k3s Logs
```bash
ssh root@<control-plane-ip> journalctl -u k3s -f
```
### Reset k3s
```bash
ansible-playbook site.yml -t reset
```
## Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|----------|----------|------------|---------|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| **Total ** | | | * * €28.93/mo** |
## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Consider adding Hetzner load balancer for API server
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
## License
MIT