Files
HetznerTerra/README.md
T
micqdf 60f466ab98
Deploy Cluster / Terraform (push) Successful in 41s
Deploy Cluster / Ansible (push) Successful in 5m37s
remove Weave GitOps addon
Drop the Flux UI addon and its Tailscale exposure because the UI lags the
current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed
through the controllers themselves and use Rancher or the flux CLI for access.
2026-04-18 18:44:55 +00:00

438 lines
13 KiB
Markdown

# Hetzner Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
## Architecture
| Component | Details |
|-----------|---------|
| **Control Plane** | 3x CX23 (HA) |
| **Workers** | 4x CX33 |
| **Total Cost** | €28.93/mo |
| **K8s** | k3s (latest, HA) |
| **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
| **Access** | SSH/API and Rancher UI restricted to Tailnet |
| **Bootstrap** | Terraform + Ansible |
### Cluster Resources
- 22 vCPU total (6 CP + 16 workers)
- 44 GB RAM total (12 CP + 32 workers)
- 440 GB SSD storage
- 140 TB bandwidth allocation
## Prerequisites
### 1. Hetzner Cloud API Token
1. Go to [Hetzner Cloud Console](https://console.hetzner.com/)
2. Select your project (or create a new one)
3. Navigate to **Security****API Tokens**
4. Click **Generate API Token**
5. Set description: `k8s-cluster-terraform`
6. Select permissions: **Read & Write**
7. Click **Generate API Token**
8. **Copy the token immediately** - it won't be shown again!
### 2. Backblaze B2 Bucket (for Terraform State)
1. Go to [Backblaze B2](https://secure.backblaze.com/b2_buckets.htm)
2. Click **Create a Bucket**
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
4. Choose **Private** access
5. Click **Create Bucket**
6. Create application key:
- Go to **App Keys****Add a New Application Key**
- Name: `terraform-state`
- Allow access to: `k8s-terraform-state` bucket only
- Type: **Read and Write**
- Copy **keyID** (access key) and **applicationKey** (secret key)
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
### 3. SSH Key Pair
```bash
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
```
### 4. Local Tools
- [Terraform](https://terraform.io/downloads) >= 1.0
- [Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
## Setup
### 1. Clone Repository
```bash
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
```
### 2. Configure Variables
```bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars`:
```hcl
hcloud_token = "your-hetzner-api-token"
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = []
allowed_api_ips = []
```
### 3. Initialize Terraform
```bash
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint = "https://s3.eu-central-003.backblazeb2.com"
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
EOF
terraform init -backend-config=backend.hcl
```
### 4. Plan and Apply
```bash
terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars
```
### 5. Generate Ansible Inventory
```bash
cd ../ansible
python3 generate_inventory.py
```
### 6. Bootstrap Cluster
```bash
ansible-playbook site.yml
```
### 7. Get Kubeconfig
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes
```
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.<your-tailnet>`).
## Gitea CI/CD
This repository includes Gitea workflows for:
- **terraform-plan**: Runs on PRs, shows planned changes
- **terraform-apply**: Runs on main branch after merge
- **ansible-deploy**: Runs after terraform apply
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
### Required Gitea Secrets
Set these in your Gitea repository settings (**Settings** → **Secrets****Actions**):
| Secret | Description |
|--------|-------------|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) |
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state`) |
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net`) |
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
## GitOps (Flux)
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
### Stable private-only baseline
The current default target is a deliberately simplified baseline:
- `1` control plane node
- `2` worker nodes
- private Hetzner network only
- Tailscale for operator access
- Flux-managed core addons only
Detailed phase gates and success criteria live in `STABLE_BASELINE.md`.
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
### Runtime secrets
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
- Doppler project: `hetznerterra`
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
- First synced secrets:
- `GRAFANA_ADMIN_PASSWORD`
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
### Repository layout
- `clusters/prod/`: cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/`: `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/`: infrastructure addon reconciliation graph
- `infrastructure/addons/*`: per-addon manifests for Flux-managed cluster addons
- `apps/`: application workload layer (currently scaffolded)
### Reconciliation graph
- `infrastructure` (top-level)
- `addon-ccm`
- `addon-csi` depends on `addon-ccm`
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
### Bootstrap notes
1. Install Flux controllers in `flux-system`.
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
### Current addon status
- Core infrastructure addons are Flux-managed from `infrastructure/addons/`.
- Active Flux addons for stable baseline: `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-external-secrets`.
- Deferred addons: `addon-ccm`, `addon-csi`, `addon-observability`, `addon-observability-content` (to be added after baseline is stable).
- Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
### Rancher access
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/dashboard/`.
- The public Hetzner load balancer path is not used for Rancher.
- Rancher uses the CNPG-backed PostgreSQL cluster in `cnpg-cluster`.
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
- Terraform create succeeds for the default `1` control plane and `2` workers.
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready`.
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
_Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable._
## Observability Stack
Flux deploys a lightweight observability stack in the `observability` namespace:
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/`.
Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
### Access Grafana and Prometheus
Preferred private access:
- Grafana: `http://grafana.silverside-gopher.ts.net/`
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
Fallback (port-forward from a tailnet-connected machine):
Run from a tailnet-connected machine:
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: `admin`
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
### Verify Tailscale exposure
```bash
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n tailscale-system get pods
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
```
If `TailscaleProxyReady=False`, check:
```bash
kubectl -n tailscale-system logs deployment/operator --tail=100
```
Common cause: OAuth client missing tag/scopes permissions.
### Fast dashboard iteration workflow
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
## File Structure
```
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
│ │ ├── addon-secrets-bootstrap/
│ │ ├── observability-content/
│ │ └── observability/
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
│ ├── ansible.yml
│ └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md
```
## Firewall Rules
| Port | Source | Purpose |
|------|--------|---------|
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
| 30000-32767 | Optional | NodePorts (disabled by default) |
## Operations
### Scale Workers
Edit `terraform.tfvars`:
```hcl
worker_count = 5
```
Then:
```bash
terraform apply
ansible-playbook site.yml
```
### Upgrade k3s
```bash
ansible-playbook site.yml -t upgrade
```
### Destroy Cluster
```bash
terraform destroy
```
## Troubleshooting
### Check k3s Logs
```bash
ssh root@<control-plane-ip> journalctl -u k3s -f
```
### Reset k3s
```bash
ansible-playbook site.yml -t reset
```
## Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|----------|----------|------------|---------|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| **Total** | | | **€28.93/mo** |
## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Consider adding Hetzner load balancer for API server
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
## License
MIT