60f466ab98
Drop the Flux UI addon and its Tailscale exposure because the UI lags the current Flux APIs and reports misleading HelmRelease errors. Keep Flux managed through the controllers themselves and use Rancher or the flux CLI for access.
438 lines
13 KiB
Markdown
438 lines
13 KiB
Markdown
# Hetzner Kubernetes Cluster
|
|
|
|
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
|
|
|
|
## Architecture
|
|
|
|
| Component | Details |
|
|
|-----------|---------|
|
|
| **Control Plane** | 3x CX23 (HA) |
|
|
| **Workers** | 4x CX33 |
|
|
| **Total Cost** | €28.93/mo |
|
|
| **K8s** | k3s (latest, HA) |
|
|
| **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
|
|
| **Access** | SSH/API and Rancher UI restricted to Tailnet |
|
|
| **Bootstrap** | Terraform + Ansible |
|
|
|
|
### Cluster Resources
|
|
- 22 vCPU total (6 CP + 16 workers)
|
|
- 44 GB RAM total (12 CP + 32 workers)
|
|
- 440 GB SSD storage
|
|
- 140 TB bandwidth allocation
|
|
|
|
## Prerequisites
|
|
|
|
### 1. Hetzner Cloud API Token
|
|
|
|
1. Go to [Hetzner Cloud Console](https://console.hetzner.com/)
|
|
2. Select your project (or create a new one)
|
|
3. Navigate to **Security** → **API Tokens**
|
|
4. Click **Generate API Token**
|
|
5. Set description: `k8s-cluster-terraform`
|
|
6. Select permissions: **Read & Write**
|
|
7. Click **Generate API Token**
|
|
8. **Copy the token immediately** - it won't be shown again!
|
|
|
|
### 2. Backblaze B2 Bucket (for Terraform State)
|
|
|
|
1. Go to [Backblaze B2](https://secure.backblaze.com/b2_buckets.htm)
|
|
2. Click **Create a Bucket**
|
|
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
|
|
4. Choose **Private** access
|
|
5. Click **Create Bucket**
|
|
6. Create application key:
|
|
- Go to **App Keys** → **Add a New Application Key**
|
|
- Name: `terraform-state`
|
|
- Allow access to: `k8s-terraform-state` bucket only
|
|
- Type: **Read and Write**
|
|
- Copy **keyID** (access key) and **applicationKey** (secret key)
|
|
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
|
|
|
|
### 3. SSH Key Pair
|
|
|
|
```bash
|
|
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
|
|
```
|
|
|
|
### 4. Local Tools
|
|
|
|
- [Terraform](https://terraform.io/downloads) >= 1.0
|
|
- [Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) >= 2.9
|
|
- Python 3 with `jinja2` and `pyyaml`
|
|
|
|
## Setup
|
|
|
|
### 1. Clone Repository
|
|
|
|
```bash
|
|
git clone <your-gitea-repo>/HetznerTerra.git
|
|
cd HetznerTerra
|
|
```
|
|
|
|
### 2. Configure Variables
|
|
|
|
```bash
|
|
cp terraform.tfvars.example terraform.tfvars
|
|
```
|
|
|
|
Edit `terraform.tfvars`:
|
|
|
|
```hcl
|
|
hcloud_token = "your-hetzner-api-token"
|
|
|
|
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
|
|
ssh_private_key = "~/.ssh/hetzner_k8s"
|
|
|
|
s3_access_key = "your-backblaze-key-id"
|
|
s3_secret_key = "your-backblaze-application-key"
|
|
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
|
|
s3_bucket = "k8s-terraform-state"
|
|
|
|
tailscale_auth_key = "tskey-auth-..."
|
|
tailscale_tailnet = "yourtailnet.ts.net"
|
|
|
|
restrict_api_ssh_to_tailnet = true
|
|
tailnet_cidr = "100.64.0.0/10"
|
|
enable_nodeport_public = false
|
|
|
|
allowed_ssh_ips = []
|
|
allowed_api_ips = []
|
|
```
|
|
|
|
### 3. Initialize Terraform
|
|
|
|
```bash
|
|
cd terraform
|
|
|
|
# Create backend config file (or use CLI args)
|
|
cat > backend.hcl << EOF
|
|
endpoint = "https://s3.eu-central-003.backblazeb2.com"
|
|
bucket = "k8s-terraform-state"
|
|
access_key = "your-backblaze-key-id"
|
|
secret_key = "your-backblaze-application-key"
|
|
skip_requesting_account_id = true
|
|
EOF
|
|
|
|
terraform init -backend-config=backend.hcl
|
|
```
|
|
|
|
### 4. Plan and Apply
|
|
|
|
```bash
|
|
terraform plan -var-file=../terraform.tfvars
|
|
terraform apply -var-file=../terraform.tfvars
|
|
```
|
|
|
|
### 5. Generate Ansible Inventory
|
|
|
|
```bash
|
|
cd ../ansible
|
|
python3 generate_inventory.py
|
|
```
|
|
|
|
### 6. Bootstrap Cluster
|
|
|
|
```bash
|
|
ansible-playbook site.yml
|
|
```
|
|
|
|
### 7. Get Kubeconfig
|
|
|
|
```bash
|
|
export KUBECONFIG=$(pwd)/outputs/kubeconfig
|
|
kubectl get nodes
|
|
```
|
|
|
|
Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.<your-tailnet>`).
|
|
|
|
## Gitea CI/CD
|
|
|
|
This repository includes Gitea workflows for:
|
|
|
|
- **terraform-plan**: Runs on PRs, shows planned changes
|
|
- **terraform-apply**: Runs on main branch after merge
|
|
- **ansible-deploy**: Runs after terraform apply
|
|
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
|
|
|
|
### Required Gitea Secrets
|
|
|
|
Set these in your Gitea repository settings (**Settings** → **Secrets** → **Actions**):
|
|
|
|
| Secret | Description |
|
|
|--------|-------------|
|
|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
|
|
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
|
|
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
|
|
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) |
|
|
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state`) |
|
|
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
|
|
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net`) |
|
|
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
|
|
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
|
|
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
|
|
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
|
|
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
|
|
| `SSH_PUBLIC_KEY` | SSH public key content |
|
|
| `SSH_PRIVATE_KEY` | SSH private key content |
|
|
|
|
## GitOps (Flux)
|
|
|
|
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
|
|
|
|
### Stable private-only baseline
|
|
|
|
The current default target is a deliberately simplified baseline:
|
|
|
|
- `1` control plane node
|
|
- `2` worker nodes
|
|
- private Hetzner network only
|
|
- Tailscale for operator access
|
|
- Flux-managed core addons only
|
|
|
|
Detailed phase gates and success criteria live in `STABLE_BASELINE.md`.
|
|
|
|
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
|
|
|
|
### Runtime secrets
|
|
|
|
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
|
|
|
|
- Doppler project: `hetznerterra`
|
|
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
|
|
- First synced secrets:
|
|
- `GRAFANA_ADMIN_PASSWORD`
|
|
|
|
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
|
|
|
|
### Repository layout
|
|
|
|
- `clusters/prod/`: cluster entrypoint and Flux reconciliation objects
|
|
- `clusters/prod/flux-system/`: `GitRepository` source and top-level `Kustomization` graph
|
|
- `infrastructure/`: infrastructure addon reconciliation graph
|
|
- `infrastructure/addons/*`: per-addon manifests for Flux-managed cluster addons
|
|
- `apps/`: application workload layer (currently scaffolded)
|
|
|
|
### Reconciliation graph
|
|
|
|
- `infrastructure` (top-level)
|
|
- `addon-ccm`
|
|
- `addon-csi` depends on `addon-ccm`
|
|
- `addon-tailscale-operator`
|
|
- `addon-observability`
|
|
- `addon-observability-content` depends on `addon-observability`
|
|
- `apps` depends on `infrastructure`
|
|
|
|
### Bootstrap notes
|
|
|
|
1. Install Flux controllers in `flux-system`.
|
|
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
|
|
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
|
|
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
|
|
|
|
### Current addon status
|
|
|
|
- Core infrastructure addons are Flux-managed from `infrastructure/addons/`.
|
|
- Active Flux addons for stable baseline: `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-external-secrets`.
|
|
- Deferred addons: `addon-ccm`, `addon-csi`, `addon-observability`, `addon-observability-content` (to be added after baseline is stable).
|
|
- Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons.
|
|
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
|
|
|
|
### Rancher access
|
|
|
|
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/dashboard/`.
|
|
- The public Hetzner load balancer path is not used for Rancher.
|
|
- Rancher uses the CNPG-backed PostgreSQL cluster in `cnpg-cluster`.
|
|
|
|
### Stable baseline acceptance
|
|
|
|
A rebuild is considered successful only when all of the following pass without manual intervention:
|
|
|
|
- Terraform create succeeds for the default `1` control plane and `2` workers.
|
|
- Ansible bootstrap succeeds end-to-end.
|
|
- All nodes become `Ready`.
|
|
- Flux core reconciliation is healthy.
|
|
- External Secrets Operator is ready.
|
|
- Tailscale operator is ready.
|
|
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
|
|
|
|
_Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable._
|
|
|
|
## Observability Stack
|
|
|
|
Flux deploys a lightweight observability stack in the `observability` namespace:
|
|
|
|
- `kube-prometheus-stack` (Prometheus + Grafana)
|
|
- `loki`
|
|
- `promtail`
|
|
|
|
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/`.
|
|
|
|
Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
|
|
|
|
### Access Grafana and Prometheus
|
|
|
|
Preferred private access:
|
|
|
|
- Grafana: `http://grafana.silverside-gopher.ts.net/`
|
|
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
|
|
|
|
Fallback (port-forward from a tailnet-connected machine):
|
|
|
|
Run from a tailnet-connected machine:
|
|
|
|
```bash
|
|
export KUBECONFIG=$(pwd)/outputs/kubeconfig
|
|
|
|
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
|
|
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
|
|
```
|
|
|
|
Then open:
|
|
|
|
- Grafana: http://127.0.0.1:3000
|
|
- Prometheus: http://127.0.0.1:9090
|
|
|
|
Grafana user: `admin`
|
|
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
|
|
|
|
### Verify Tailscale exposure
|
|
|
|
```bash
|
|
export KUBECONFIG=$(pwd)/outputs/kubeconfig
|
|
|
|
kubectl -n tailscale-system get pods
|
|
kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus
|
|
kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady
|
|
kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady
|
|
```
|
|
|
|
If `TailscaleProxyReady=False`, check:
|
|
|
|
```bash
|
|
kubectl -n tailscale-system logs deployment/operator --tail=100
|
|
```
|
|
|
|
Common cause: OAuth client missing tag/scopes permissions.
|
|
|
|
### Fast dashboard iteration workflow
|
|
|
|
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
|
|
It avoids full cluster provisioning and only applies Grafana content resources:
|
|
|
|
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
|
|
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
|
|
- `ansible/dashboards.yml`
|
|
|
|
## File Structure
|
|
|
|
```
|
|
.
|
|
├── terraform/
|
|
│ ├── main.tf
|
|
│ ├── variables.tf
|
|
│ ├── network.tf
|
|
│ ├── firewall.tf
|
|
│ ├── ssh.tf
|
|
│ ├── servers.tf
|
|
│ ├── outputs.tf
|
|
│ └── backend.tf
|
|
├── ansible/
|
|
│ ├── inventory.tmpl
|
|
│ ├── generate_inventory.py
|
|
│ ├── site.yml
|
|
│ ├── roles/
|
|
│ │ ├── common/
|
|
│ │ ├── k3s-server/
|
|
│ │ ├── k3s-agent/
|
|
│ │ ├── addon-secrets-bootstrap/
|
|
│ │ ├── observability-content/
|
|
│ │ └── observability/
|
|
│ └── ansible.cfg
|
|
├── .gitea/
|
|
│ └── workflows/
|
|
│ ├── terraform.yml
|
|
│ ├── ansible.yml
|
|
│ └── dashboards.yml
|
|
├── outputs/
|
|
├── terraform.tfvars.example
|
|
└── README.md
|
|
```
|
|
|
|
## Firewall Rules
|
|
|
|
| Port | Source | Purpose |
|
|
|------|--------|---------|
|
|
| 22 | Tailnet CIDR | SSH |
|
|
| 6443 | Tailnet CIDR + internal | Kubernetes API |
|
|
| 41641/udp | Any | Tailscale WireGuard |
|
|
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
|
|
| 2379 | 10.0.0.0/16 | etcd Client |
|
|
| 2380 | 10.0.0.0/16 | etcd Peer |
|
|
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
|
|
| 10250 | 10.0.0.0/16 | Kubelet |
|
|
| 30000-32767 | Optional | NodePorts (disabled by default) |
|
|
|
|
## Operations
|
|
|
|
### Scale Workers
|
|
|
|
Edit `terraform.tfvars`:
|
|
|
|
```hcl
|
|
worker_count = 5
|
|
```
|
|
|
|
Then:
|
|
|
|
```bash
|
|
terraform apply
|
|
ansible-playbook site.yml
|
|
```
|
|
|
|
### Upgrade k3s
|
|
|
|
```bash
|
|
ansible-playbook site.yml -t upgrade
|
|
```
|
|
|
|
### Destroy Cluster
|
|
|
|
```bash
|
|
terraform destroy
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Check k3s Logs
|
|
|
|
```bash
|
|
ssh root@<control-plane-ip> journalctl -u k3s -f
|
|
```
|
|
|
|
### Reset k3s
|
|
|
|
```bash
|
|
ansible-playbook site.yml -t reset
|
|
```
|
|
|
|
## Costs Breakdown
|
|
|
|
| Resource | Quantity | Unit Price | Monthly |
|
|
|----------|----------|------------|---------|
|
|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
|
|
| CX33 (Workers) | 4 | €4.99 | €19.96 |
|
|
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
|
|
| **Total** | | | **€28.93/mo** |
|
|
|
|
## Security Notes
|
|
|
|
- Control plane has HA (3 nodes, can survive 1 failure)
|
|
- Consider adding Hetzner load balancer for API server
|
|
- Rotate API tokens regularly
|
|
- Use network policies in Kubernetes
|
|
- Enable audit logging for production
|
|
|
|
## License
|
|
|
|
MIT
|