# Hetzner Kubernetes Cluster Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible. ## Architecture | Component | Details | |-----------|---------| | **Control Plane** | 3x CX23 (HA) | | **Workers** | 4x CX33 | | **Total Cost** | €28.93/mo | | **K8s** | k3s (latest, HA) | | **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki | | **Access** | SSH/API and Rancher UI restricted to Tailnet | | **Bootstrap** | Terraform + Ansible | ### Cluster Resources - 22 vCPU total (6 CP + 16 workers) - 44 GB RAM total (12 CP + 32 workers) - 440 GB SSD storage - 140 TB bandwidth allocation ## Prerequisites ### 1. Hetzner Cloud API Token 1. Go to [Hetzner Cloud Console](https://console.hetzner.com/) 2. Select your project (or create a new one) 3. Navigate to **Security** → **API Tokens** 4. Click **Generate API Token** 5. Set description: `k8s-cluster-terraform` 6. Select permissions: **Read & Write** 7. Click **Generate API Token** 8. **Copy the token immediately** - it won't be shown again! ### 2. Backblaze B2 Bucket (for Terraform State) 1. Go to [Backblaze B2](https://secure.backblaze.com/b2_buckets.htm) 2. Click **Create a Bucket** 3. Set bucket name: `k8s-terraform-state` (must be globally unique) 4. Choose **Private** access 5. Click **Create Bucket** 6. Create application key: - Go to **App Keys** → **Add a New Application Key** - Name: `terraform-state` - Allow access to: `k8s-terraform-state` bucket only - Type: **Read and Write** - Copy **keyID** (access key) and **applicationKey** (secret key) 7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) ### 3. SSH Key Pair ```bash ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s ``` ### 4. Local Tools - [Terraform](https://terraform.io/downloads) >= 1.0 - [Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) >= 2.9 - Python 3 with `jinja2` and `pyyaml` ## Setup ### 1. Clone Repository ```bash git clone /HetznerTerra.git cd HetznerTerra ``` ### 2. Configure Variables ```bash cp terraform.tfvars.example terraform.tfvars ``` Edit `terraform.tfvars`: ```hcl hcloud_token = "your-hetzner-api-token" ssh_public_key = "~/.ssh/hetzner_k8s.pub" ssh_private_key = "~/.ssh/hetzner_k8s" s3_access_key = "your-backblaze-key-id" s3_secret_key = "your-backblaze-application-key" s3_endpoint = "https://s3.eu-central-003.backblazeb2.com" s3_bucket = "k8s-terraform-state" tailscale_auth_key = "tskey-auth-..." tailscale_tailnet = "yourtailnet.ts.net" restrict_api_ssh_to_tailnet = true tailnet_cidr = "100.64.0.0/10" enable_nodeport_public = false allowed_ssh_ips = [] allowed_api_ips = [] ``` ### 3. Initialize Terraform ```bash cd terraform # Create backend config file (or use CLI args) cat > backend.hcl << EOF endpoint = "https://s3.eu-central-003.backblazeb2.com" bucket = "k8s-terraform-state" access_key = "your-backblaze-key-id" secret_key = "your-backblaze-application-key" skip_requesting_account_id = true EOF terraform init -backend-config=backend.hcl ``` ### 4. Plan and Apply ```bash terraform plan -var-file=../terraform.tfvars terraform apply -var-file=../terraform.tfvars ``` ### 5. Generate Ansible Inventory ```bash cd ../ansible python3 generate_inventory.py ``` ### 6. Bootstrap Cluster ```bash ansible-playbook site.yml ``` ### 7. Get Kubeconfig ```bash export KUBECONFIG=$(pwd)/outputs/kubeconfig kubectl get nodes ``` Kubeconfig endpoint is rewritten to the primary control-plane tailnet hostname (`k8s-cluster-cp-1.`). ## Gitea CI/CD This repository includes Gitea workflows for: - **terraform-plan**: Runs on PRs, shows planned changes - **terraform-apply**: Runs on main branch after merge - **ansible-deploy**: Runs after terraform apply - **dashboards**: Fast workflow that updates Grafana datasources/dashboards only ### Required Gitea Secrets Set these in your Gitea repository settings (**Settings** → **Secrets** → **Actions**): | Secret | Description | |--------|-------------| | `HCLOUD_TOKEN` | Hetzner Cloud API token | | `S3_ACCESS_KEY` | Backblaze B2 keyID | | `S3_SECRET_KEY` | Backblaze B2 applicationKey | | `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) | | `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state`) | | `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap | | `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net`) | | `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator | | `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator | | `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets | | `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) | | `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets | | `SSH_PUBLIC_KEY` | SSH public key content | | `SSH_PRIVATE_KEY` | SSH private key content | ## GitOps (Flux) This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap. ### Stable private-only baseline The current default target is a deliberately simplified baseline: - `1` control plane node - `2` worker nodes - private Hetzner network only - Tailscale for operator access - Flux-managed core addons only Detailed phase gates and success criteria live in `STABLE_BASELINE.md`. This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later. ### Runtime secrets Runtime cluster secrets are moving to Doppler + External Secrets Operator. - Doppler project: `hetznerterra` - Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` - First synced secrets: - `GRAFANA_ADMIN_PASSWORD` - `WEAVE_GITOPS_ADMIN_USERNAME` - `WEAVE_GITOPS_ADMIN_PASSWORD_BCRYPT_HASH` Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler. ### Repository layout - `clusters/prod/`: cluster entrypoint and Flux reconciliation objects - `clusters/prod/flux-system/`: `GitRepository` source and top-level `Kustomization` graph - `infrastructure/`: infrastructure addon reconciliation graph - `infrastructure/addons/*`: per-addon manifests for Flux-managed cluster addons - `apps/`: application workload layer (currently scaffolded) ### Reconciliation graph - `infrastructure` (top-level) - `addon-ccm` - `addon-csi` depends on `addon-ccm` - `addon-tailscale-operator` - `addon-observability` - `addon-observability-content` depends on `addon-observability` - `apps` depends on `infrastructure` ### Bootstrap notes 1. Install Flux controllers in `flux-system`. 2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace. 3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph. 4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap. ### Current addon status - Core infrastructure addons are Flux-managed from `infrastructure/addons/`. - Active Flux addons for stable baseline: `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-external-secrets`. - Deferred addons: `addon-ccm`, `addon-csi`, `addon-observability`, `addon-observability-content` (to be added after baseline is stable). - Ansible is limited to cluster bootstrap, private-access setup, and prerequisite secret creation for Flux-managed addons. - `addon-flux-ui` is optional for the stable-baseline phase and is not a blocker for rebuild success. ### Rancher access - Rancher is private-only and exposed through Tailscale at `https://rancher-1.silverside-gopher.ts.net/dashboard/`. - The public Hetzner load balancer path is not used for Rancher. - Rancher uses the CNPG-backed PostgreSQL cluster in `cnpg-cluster`. ### Stable baseline acceptance A rebuild is considered successful only when all of the following pass without manual intervention: - Terraform create succeeds for the default `1` control plane and `2` workers. - Ansible bootstrap succeeds end-to-end. - All nodes become `Ready`. - Flux core reconciliation is healthy. - External Secrets Operator is ready. - Tailscale operator is ready. - Terraform destroy succeeds cleanly or succeeds after workflow retries. _Note: Observability stack (Grafana/Prometheus) is deferred and will be added once the core platform baseline is stable._ ## Observability Stack Flux deploys a lightweight observability stack in the `observability` namespace: - `kube-prometheus-stack` (Prometheus + Grafana) - `loki` - `promtail` Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/`. Grafana and Prometheus are exposed through a single Tailscale front door backed by Traefik when the Tailscale Kubernetes Operator is healthy. ### Access Grafana and Prometheus Preferred private access: - Grafana: `http://k8s-cluster-cp-1.:30080/` - Prometheus: `http://k8s-cluster-cp-1.:30990/` - Flux UI: `http://k8s-cluster-cp-1.:30901/` This access path is bootstrapped automatically by Ansible on `control_plane[0]` using persistent `kubectl port-forward` systemd services plus `tailscale serve`, so it survives cluster rebuilds. Fallback (port-forward from a tailnet-connected machine): Run from a tailnet-connected machine: ```bash export KUBECONFIG=$(pwd)/outputs/kubeconfig kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80 kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090 ``` Then open: - Grafana: http://127.0.0.1:3000 - Prometheus: http://127.0.0.1:9090 Grafana user: `admin` Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output) ### Verify Tailscale exposure ```bash export KUBECONFIG=$(pwd)/outputs/kubeconfig kubectl -n tailscale-system get pods kubectl -n observability get svc kube-prometheus-stack-grafana kube-prometheus-stack-prometheus kubectl -n observability describe svc kube-prometheus-stack-grafana | grep TailscaleProxyReady kubectl -n observability describe svc kube-prometheus-stack-prometheus | grep TailscaleProxyReady ``` If `TailscaleProxyReady=False`, check: ```bash kubectl -n tailscale-system logs deployment/operator --tail=100 ``` Common cause: OAuth client missing tag/scopes permissions. ### Fast dashboard iteration workflow Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates. It avoids full cluster provisioning and only applies Grafana content resources: - `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2` - `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2` - `ansible/dashboards.yml` ## File Structure ``` . ├── terraform/ │ ├── main.tf │ ├── variables.tf │ ├── network.tf │ ├── firewall.tf │ ├── ssh.tf │ ├── servers.tf │ ├── outputs.tf │ └── backend.tf ├── ansible/ │ ├── inventory.tmpl │ ├── generate_inventory.py │ ├── site.yml │ ├── roles/ │ │ ├── common/ │ │ ├── k3s-server/ │ │ ├── k3s-agent/ │ │ ├── addon-secrets-bootstrap/ │ │ ├── observability-content/ │ │ └── observability/ │ └── ansible.cfg ├── .gitea/ │ └── workflows/ │ ├── terraform.yml │ ├── ansible.yml │ └── dashboards.yml ├── outputs/ ├── terraform.tfvars.example └── README.md ``` ## Firewall Rules | Port | Source | Purpose | |------|--------|---------| | 22 | Tailnet CIDR | SSH | | 6443 | Tailnet CIDR + internal | Kubernetes API | | 41641/udp | Any | Tailscale WireGuard | | 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) | | 2379 | 10.0.0.0/16 | etcd Client | | 2380 | 10.0.0.0/16 | etcd Peer | | 8472 | 10.0.0.0/16 | Flannel VXLAN | | 10250 | 10.0.0.0/16 | Kubelet | | 30000-32767 | Optional | NodePorts (disabled by default) | ## Operations ### Scale Workers Edit `terraform.tfvars`: ```hcl worker_count = 5 ``` Then: ```bash terraform apply ansible-playbook site.yml ``` ### Upgrade k3s ```bash ansible-playbook site.yml -t upgrade ``` ### Destroy Cluster ```bash terraform destroy ``` ## Troubleshooting ### Check k3s Logs ```bash ssh root@ journalctl -u k3s -f ``` ### Reset k3s ```bash ansible-playbook site.yml -t reset ``` ## Costs Breakdown | Resource | Quantity | Unit Price | Monthly | |----------|----------|------------|---------| | CX23 (Control Plane) | 3 | €2.99 | €8.97 | | CX33 (Workers) | 4 | €4.99 | €19.96 | | Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 | | **Total** | | | **€28.93/mo** | ## Security Notes - Control plane has HA (3 nodes, can survive 1 failure) - Consider adding Hetzner load balancer for API server - Rotate API tokens regularly - Use network policies in Kubernetes - Enable audit logging for production ## License MIT