2026-02-28 20:24:55 +00:00
# Hetzner Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
## Architecture
| Component | Details |
|-----------|---------|
| **Control Plane ** | 3x CX23 (HA) |
| **Workers ** | 4x CX33 |
| **Total Cost ** | €28.93/mo |
| **K8s ** | k3s (latest, HA) |
2026-03-02 01:33:41 +00:00
| **Addons ** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
2026-03-01 04:04:56 +00:00
| **Access ** | SSH/API restricted to Tailnet |
2026-02-28 20:24:55 +00:00
| **Bootstrap ** | Terraform + Ansible |
### Cluster Resources
- 22 vCPU total (6 CP + 16 workers)
- 44 GB RAM total (12 CP + 32 workers)
- 440 GB SSD storage
- 140 TB bandwidth allocation
## Prerequisites
### 1. Hetzner Cloud API Token
1. Go to [Hetzner Cloud Console ](https://console.hetzner.com/ )
2. Select your project (or create a new one)
3. Navigate to **Security ** → **API Tokens **
4. Click **Generate API Token **
5. Set description: `k8s-cluster-terraform`
6. Select permissions: **Read & Write **
7. Click **Generate API Token **
8. **Copy the token immediately ** - it won't be shown again!
### 2. Backblaze B2 Bucket (for Terraform State)
1. Go to [Backblaze B2 ](https://secure.backblaze.com/b2_buckets.htm )
2. Click **Create a Bucket **
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
4. Choose **Private ** access
5. Click **Create Bucket **
6. Create application key:
- Go to **App Keys ** → **Add a New Application Key **
- Name: `terraform-state`
- Allow access to: `k8s-terraform-state` bucket only
- Type: **Read and Write **
- Copy **keyID ** (access key) and **applicationKey ** (secret key)
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` )
### 3. SSH Key Pair
```bash
ssh-keygen -t ed25519 -C "k8s@hetzner " -f ~/.ssh/hetzner_k8s
```
### 4. Local Tools
- [Terraform ](https://terraform.io/downloads ) >= 1.0
- [Ansible ](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html ) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
## Setup
### 1. Clone Repository
```bash
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
```
### 2. Configure Variables
```bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars` :
```hcl
hcloud_token = "your-hetzner-api-token"
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
2026-03-01 04:04:56 +00:00
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = []
allowed_api_ips = []
2026-02-28 20:24:55 +00:00
```
### 3. Initialize Terraform
```bash
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
2026-02-28 21:01:58 +00:00
endpoint = "https://s3.eu-central-003.backblazeb2.com"
2026-02-28 20:58:40 +00:00
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
2026-02-28 20:24:55 +00:00
EOF
terraform init -backend-config=backend.hcl
```
### 4. Plan and Apply
```bash
terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars
```
### 5. Generate Ansible Inventory
```bash
cd ../ansible
python3 generate_inventory.py
```
### 6. Bootstrap Cluster
```bash
ansible-playbook site.yml
```
### 7. Get Kubeconfig
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes
```
2026-03-01 17:12:12 +00:00
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.<your-tailnet>` ).
2026-02-28 20:24:55 +00:00
## Gitea CI/CD
This repository includes Gitea workflows for:
- **terraform-plan**: Runs on PRs, shows planned changes
- **terraform-apply**: Runs on main branch after merge
- **ansible-deploy**: Runs after terraform apply
2026-03-04 03:36:01 +00:00
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
2026-02-28 20:24:55 +00:00
### Required Gitea Secrets
Set these in your Gitea repository settings (**Settings** → **Secrets ** → **Actions ** ):
| Secret | Description |
|--------|-------------|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` ) |
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state` ) |
2026-03-01 04:04:56 +00:00
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net` ) |
2026-03-04 01:37:00 +00:00
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
2026-03-02 01:33:41 +00:00
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
2026-03-01 14:47:32 +00:00
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
2026-02-28 20:24:55 +00:00
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
2026-03-04 23:38:40 +00:00
## GitOps (Flux)
This repo now includes a Flux GitOps layout for phased migration from imperative Ansible applies to continuous reconciliation.
### Repository layout
- `clusters/prod/` : cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/` : `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/` : infrastructure addon reconciliation graph
- `infrastructure/addons/*` : per-addon manifests (observability + observability-content migrated)
- `apps/` : application workload layer (currently scaffolded)
### Reconciliation graph
- `infrastructure` (top-level)
- `addon-ccm`
- `addon-csi` depends on `addon-ccm`
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
### Bootstrap notes
1. Install Flux controllers in `flux-system` .
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
4. Unsuspend addon `Kustomization` objects one-by-one as each addon is migrated from Ansible.
### Current migration status
- `addon-observability-content` is now GitOps-managed from `infrastructure/addons/observability-content/` .
- `addon-observability` is now GitOps-managed from `infrastructure/addons/observability/` using Flux `HelmRelease` resources for:
- `kube-prometheus-stack`
- `loki`
- `promtail`
- Remaining addons stay suspended until migrated.
- During transition, avoid applying Grafana content from both Flux and Ansible at the same time.
Ansible `site.yml` now skips `observability` and `observability-content` roles by default when `observability_gitops_enabled=true` (default).
2026-03-02 01:33:41 +00:00
## Observability Stack
The Ansible playbook deploys a lightweight observability stack in the `observability` namespace:
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
2026-03-04 23:38:40 +00:00
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/` (Flux), migrated from `ansible/roles/observability-content/` .
2026-03-04 03:36:01 +00:00
2026-03-04 01:37:00 +00:00
Services are kept internal by default, with optional declarative Tailscale exposure when the Tailscale Kubernetes Operator is healthy.
2026-03-02 01:33:41 +00:00
### Access Grafana and Prometheus
2026-03-04 01:37:00 +00:00
Preferred (when Tailscale Operator is healthy):
- Grafana: `http://grafana` (or `http://grafana.<your-tailnet>` )
- Prometheus: `http://prometheus` (or `http://prometheus.<your-tailnet>` )
Fallback (port-forward from a tailnet-connected machine):
2026-03-02 01:33:41 +00:00
Run from a tailnet-connected machine:
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: `admin`
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
2026-03-04 01:37:00 +00:00
### Verify Tailscale exposure
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
```
If `TailscaleProxyReady=False` , check:
```bash
kubectl -n tailscale-system logs deployment/operator --tail=100
```
Common cause: OAuth client missing tag/scopes permissions.
2026-03-04 03:36:01 +00:00
### Fast dashboard iteration workflow
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
2026-02-28 20:24:55 +00:00
## File Structure
```
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
2026-03-01 17:12:12 +00:00
│ │ ├── ccm/
2026-03-02 01:33:41 +00:00
│ │ ├── csi/
2026-03-04 01:37:00 +00:00
│ │ ├── tailscale-operator/
2026-03-04 03:36:01 +00:00
│ │ ├── observability-content/
2026-03-02 01:33:41 +00:00
│ │ └── observability/
2026-02-28 20:24:55 +00:00
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
2026-03-04 03:36:01 +00:00
│ ├── ansible.yml
│ └── dashboards.yml
2026-02-28 20:24:55 +00:00
├── outputs/
├── terraform.tfvars.example
└── README.md
```
## Firewall Rules
| Port | Source | Purpose |
|------|--------|---------|
2026-03-01 04:04:56 +00:00
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
2026-02-28 20:24:55 +00:00
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
2026-03-01 04:04:56 +00:00
| 30000-32767 | Optional | NodePorts (disabled by default) |
2026-02-28 20:24:55 +00:00
## Operations
### Scale Workers
Edit `terraform.tfvars` :
```hcl
worker_count = 5
```
Then:
```bash
terraform apply
ansible-playbook site.yml
```
### Upgrade k3s
```bash
ansible-playbook site.yml -t upgrade
```
### Destroy Cluster
```bash
terraform destroy
```
## Troubleshooting
### Check k3s Logs
```bash
ssh root@<control-plane-ip> journalctl -u k3s -f
```
### Reset k3s
```bash
ansible-playbook site.yml -t reset
```
## Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|----------|----------|------------|---------|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| **Total ** | | | * * €28.93/mo** |
## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Consider adding Hetzner load balancer for API server
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
## License
MIT