413 lines
12 KiB
Markdown
413 lines
12 KiB
Markdown
# Hetzner Kubernetes Cluster
|
|
|
|
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
|
|
|
|
## Architecture
|
|
|
|
| Component | Details |
|
|
|-----------|---------|
|
|
| **Control Plane** | 3x CX23 (HA) |
|
|
| **Workers** | 4x CX33 |
|
|
| **Total Cost** | €28.93/mo |
|
|
| **K8s** | k3s (latest, HA) |
|
|
| **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
|
|
| **Access** | SSH/API restricted to Tailnet |
|
|
| **Bootstrap** | Terraform + Ansible |
|
|
|
|
### Cluster Resources
|
|
- 22 vCPU total (6 CP + 16 workers)
|
|
- 44 GB RAM total (12 CP + 32 workers)
|
|
- 440 GB SSD storage
|
|
- 140 TB bandwidth allocation
|
|
|
|
## Prerequisites
|
|
|
|
### 1. Hetzner Cloud API Token
|
|
|
|
1. Go to [Hetzner Cloud Console](https://console.hetzner.com/)
|
|
2. Select your project (or create a new one)
|
|
3. Navigate to **Security** → **API Tokens**
|
|
4. Click **Generate API Token**
|
|
5. Set description: `k8s-cluster-terraform`
|
|
6. Select permissions: **Read & Write**
|
|
7. Click **Generate API Token**
|
|
8. **Copy the token immediately** - it won't be shown again!
|
|
|
|
### 2. Backblaze B2 Bucket (for Terraform State)
|
|
|
|
1. Go to [Backblaze B2](https://secure.backblaze.com/b2_buckets.htm)
|
|
2. Click **Create a Bucket**
|
|
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
|
|
4. Choose **Private** access
|
|
5. Click **Create Bucket**
|
|
6. Create application key:
|
|
- Go to **App Keys** → **Add a New Application Key**
|
|
- Name: `terraform-state`
|
|
- Allow access to: `k8s-terraform-state` bucket only
|
|
- Type: **Read and Write**
|
|
- Copy **keyID** (access key) and **applicationKey** (secret key)
|
|
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
|
|
|
|
### 3. SSH Key Pair
|
|
|
|
```bash
|
|
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
|
|
```
|
|
|
|
### 4. Local Tools
|
|
|
|
- [Terraform](https://terraform.io/downloads) >= 1.0
|
|
- [Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) >= 2.9
|
|
- Python 3 with `jinja2` and `pyyaml`
|
|
|
|
## Setup
|
|
|
|
### 1. Clone Repository
|
|
|
|
```bash
|
|
git clone <your-gitea-repo>/HetznerTerra.git
|
|
cd HetznerTerra
|
|
```
|
|
|
|
### 2. Configure Variables
|
|
|
|
```bash
|
|
cp terraform.tfvars.example terraform.tfvars
|
|
```
|
|
|
|
Edit `terraform.tfvars`:
|
|
|
|
```hcl
|
|
hcloud_token = "your-hetzner-api-token"
|
|
|
|
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
|
|
ssh_private_key = "~/.ssh/hetzner_k8s"
|
|
|
|
s3_access_key = "your-backblaze-key-id"
|
|
s3_secret_key = "your-backblaze-application-key"
|
|
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
|
|
s3_bucket = "k8s-terraform-state"
|
|
|
|
tailscale_auth_key = "tskey-auth-..."
|
|
tailscale_tailnet = "yourtailnet.ts.net"
|
|
|
|
restrict_api_ssh_to_tailnet = true
|
|
tailnet_cidr = "100.64.0.0/10"
|
|
enable_nodeport_public = false
|
|
|
|
allowed_ssh_ips = []
|
|
allowed_api_ips = []
|
|
```
|
|
|
|
### 3. Initialize Terraform
|
|
|
|
```bash
|
|
cd terraform
|
|
|
|
# Create backend config file (or use CLI args)
|
|
cat > backend.hcl << EOF
|
|
endpoint = "https://s3.eu-central-003.backblazeb2.com"
|
|
bucket = "k8s-terraform-state"
|
|
access_key = "your-backblaze-key-id"
|
|
secret_key = "your-backblaze-application-key"
|
|
skip_requesting_account_id = true
|
|
EOF
|
|
|
|
terraform init -backend-config=backend.hcl
|
|
```
|
|
|
|
### 4. Plan and Apply
|
|
|
|
```bash
|
|
terraform plan -var-file=../terraform.tfvars
|
|
terraform apply -var-file=../terraform.tfvars
|
|
```
|
|
|
|
### 5. Generate Ansible Inventory
|
|
|
|
```bash
|
|
cd ../ansible
|
|
python3 generate_inventory.py
|
|
```
|
|
|
|
### 6. Bootstrap Cluster
|
|
|
|
```bash
|
|
ansible-playbook site.yml
|
|
```
|
|
|
|
### 7. Get Kubeconfig
|
|
|
|
```bash
|
|
export KUBECONFIG=$(pwd)/outputs/kubeconfig
|
|
kubectl get nodes
|
|
```
|
|
|
|
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.<your-tailnet>`).
|
|
|
|
## Gitea CI/CD
|
|
|
|
This repository includes Gitea workflows for:
|
|
|
|
- **terraform-plan**: Runs on PRs, shows planned changes
|
|
- **terraform-apply**: Runs on main branch after merge
|
|
- **ansible-deploy**: Runs after terraform apply
|
|
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
|
|
|
|
### Required Gitea Secrets
|
|
|
|
Set these in your Gitea repository settings (**Settings** → **Secrets** → **Actions**):
|
|
|
|
| Secret | Description |
|
|
|--------|-------------|
|
|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
|
|
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
|
|
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
|
|
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) |
|
|
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state`) |
|
|
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
|
|
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net`) |
|
|
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
|
|
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
|
|
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
|
|
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
|
|
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
|
|
| `SSH_PUBLIC_KEY` | SSH public key content |
|
|
| `SSH_PRIVATE_KEY` | SSH private key content |
|
|
|
|
## GitOps (Flux)
|
|
|
|
This repo now includes a Flux GitOps layout for phased migration from imperative Ansible applies to continuous reconciliation.
|
|
|
|
### Runtime secrets
|
|
|
|
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
|
|
|
|
- Doppler project: `hetznerterra`
|
|
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
|
|
- First synced secrets:
|
|
- `GRAFANA_ADMIN_PASSWORD`
|
|
- `WEAVE_GITOPS_ADMIN_USERNAME`
|
|
- `WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH`
|
|
|
|
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
|
|
|
|
### Repository layout
|
|
|
|
- `clusters/prod/`: cluster entrypoint and Flux reconciliation objects
|
|
- `clusters/prod/flux-system/`: `GitRepository` source and top-level `Kustomization` graph
|
|
- `infrastructure/`: infrastructure addon reconciliation graph
|
|
- `infrastructure/addons/*`: per-addon manifests for Flux-managed cluster addons
|
|
- `apps/`: application workload layer (currently scaffolded)
|
|
|
|
### Reconciliation graph
|
|
|
|
- `infrastructure` (top-level)
|
|
- `addon-ccm`
|
|
- `addon-csi` depends on `addon-ccm`
|
|
- `addon-tailscale-operator`
|
|
- `addon-observability`
|
|
- `addon-observability-content` depends on `addon-observability`
|
|
- `apps` depends on `infrastructure`
|
|
|
|
### Bootstrap notes
|
|
|
|
1. Install Flux controllers in `flux-system`.
|
|
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
|
|
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
|
|
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
|
|
|
|
### Current migration status
|
|
|
|
- `addon-observability-content` is now GitOps-managed from `infrastructure/addons/observability-content/`.
|
|
- `addon-observability` is now GitOps-managed from `infrastructure/addons/observability/` using Flux `HelmRelease` resources for:
|
|
- `kube-prometheus-stack`
|
|
- `loki`
|
|
- `promtail`
|
|
- Remaining addons stay suspended until migrated.
|
|
- During transition, avoid applying Grafana content from both Flux and Ansible at the same time.
|
|
|
|
Ansible `site.yml` now skips `observability` and `observability-content` roles by default when `observability_gitops_enabled=true` (default).
|
|
|
|
## Observability Stack
|
|
|
|
Flux deploys a lightweight observability stack in the `observability` namespace:
|
|
|
|
- `kube-prometheus-stack` (Prometheus + Grafana)
|
|
- `loki`
|
|
- `promtail`
|
|
|
|
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/` (Flux), migrated from `ansible/roles/observability-content/`.
|
|
|
|
Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy.
|
|
|
|
### Access Grafana and Prometheus
|
|
|
|
Preferred private access:
|
|
|
|
- Grafana: `http://k8s-cluster-cp-1.<your-tailnet>:30080/`
|
|
- Prometheus: `http://k8s-cluster-cp-1.<your-tailnet>:30990/`
|
|
- Flux UI: `http://k8s-cluster-cp-1.<your-tailnet>:30901/`
|
|
|
|
This access path is bootstrapped automatically by Ansible on `control_plane[0]` using persistent `kubectl port-forward` systemd services plus `tailscale serve`, so it survives cluster rebuilds.
|
|
|
|
Fallback (port-forward from a tailnet-connected machine):
|
|
|
|
Run from a tailnet-connected machine:
|
|
|
|
```bash
|
|
export KUBECONFIG=$(pwd)/outputs/kubeconfig
|
|
|
|
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
|
|
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
|
|
```
|
|
|
|
Then open:
|
|
|
|
- Grafana: http://127.0.0.1:3000
|
|
- Prometheus: http://127.0.0.1:9090
|
|
|
|
Grafana user: `admin`
|
|
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
|
|
|
|
### Verify Tailscale exposure
|
|
|
|
```bash
|
|
export KUBECONFIG=$(pwd)/outputs/kubeconfig
|
|
|
|
kubectl -n tailscale-system get pods
|
|
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
|
|
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
|
|
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
|
|
```
|
|
|
|
If `TailscaleProxyReady=False`, check:
|
|
|
|
```bash
|
|
kubectl -n tailscale-system logs deployment/operator --tail=100
|
|
```
|
|
|
|
Common cause: OAuth client missing tag/scopes permissions.
|
|
|
|
### Fast dashboard iteration workflow
|
|
|
|
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
|
|
It avoids full cluster provisioning and only applies Grafana content resources:
|
|
|
|
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
|
|
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
|
|
- `ansible/dashboards.yml`
|
|
|
|
## File Structure
|
|
|
|
```
|
|
.
|
|
├── terraform/
|
|
│ ├── main.tf
|
|
│ ├── variables.tf
|
|
│ ├── network.tf
|
|
│ ├── firewall.tf
|
|
│ ├── ssh.tf
|
|
│ ├── servers.tf
|
|
│ ├── outputs.tf
|
|
│ └── backend.tf
|
|
├── ansible/
|
|
│ ├── inventory.tmpl
|
|
│ ├── generate_inventory.py
|
|
│ ├── site.yml
|
|
│ ├── roles/
|
|
│ │ ├── common/
|
|
│ │ ├── k3s-server/
|
|
│ │ ├── k3s-agent/
|
|
│ │ ├── addon-secrets-bootstrap/
|
|
│ │ ├── observability-content/
|
|
│ │ └── observability/
|
|
│ └── ansible.cfg
|
|
├── .gitea/
|
|
│ └── workflows/
|
|
│ ├── terraform.yml
|
|
│ ├── ansible.yml
|
|
│ └── dashboards.yml
|
|
├── outputs/
|
|
├── terraform.tfvars.example
|
|
└── README.md
|
|
```
|
|
|
|
## Firewall Rules
|
|
|
|
| Port | Source | Purpose |
|
|
|------|--------|---------|
|
|
| 22 | Tailnet CIDR | SSH |
|
|
| 6443 | Tailnet CIDR + internal | Kubernetes API |
|
|
| 41641/udp | Any | Tailscale WireGuard |
|
|
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
|
|
| 2379 | 10.0.0.0/16 | etcd Client |
|
|
| 2380 | 10.0.0.0/16 | etcd Peer |
|
|
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
|
|
| 10250 | 10.0.0.0/16 | Kubelet |
|
|
| 30000-32767 | Optional | NodePorts (disabled by default) |
|
|
|
|
## Operations
|
|
|
|
### Scale Workers
|
|
|
|
Edit `terraform.tfvars`:
|
|
|
|
```hcl
|
|
worker_count = 5
|
|
```
|
|
|
|
Then:
|
|
|
|
```bash
|
|
terraform apply
|
|
ansible-playbook site.yml
|
|
```
|
|
|
|
### Upgrade k3s
|
|
|
|
```bash
|
|
ansible-playbook site.yml -t upgrade
|
|
```
|
|
|
|
### Destroy Cluster
|
|
|
|
```bash
|
|
terraform destroy
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Check k3s Logs
|
|
|
|
```bash
|
|
ssh root@<control-plane-ip> journalctl -u k3s -f
|
|
```
|
|
|
|
### Reset k3s
|
|
|
|
```bash
|
|
ansible-playbook site.yml -t reset
|
|
```
|
|
|
|
## Costs Breakdown
|
|
|
|
| Resource | Quantity | Unit Price | Monthly |
|
|
|----------|----------|------------|---------|
|
|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
|
|
| CX33 (Workers) | 4 | €4.99 | €19.96 |
|
|
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
|
|
| **Total** | | | **€28.93/mo** |
|
|
|
|
## Security Notes
|
|
|
|
- Control plane has HA (3 nodes, can survive 1 failure)
|
|
- Consider adding Hetzner load balancer for API server
|
|
- Rotate API tokens regularly
|
|
- Use network policies in Kubernetes
|
|
- Enable audit logging for production
|
|
|
|
## License
|
|
|
|
MIT
|