feat: migrate cluster baseline from Hetzner to Proxmox
Deploy Cluster / Terraform (push) Failing after 52s
Deploy Cluster / Ansible (push) Has been skipped
Deploy Grafana Content / Grafana Content (push) Failing after 1m37s

Replace Hetzner infrastructure and cloud-provider assumptions with Proxmox
VM clones, kube-vip API HA, and NFS-backed storage. Update bootstrap,
Flux addons, CI workflows, and docs to target the new private Proxmox
baseline while preserving the existing Tailscale, Doppler, Flux, Rancher,
and B2 backup flows.
This commit is contained in:
2026-04-22 03:02:13 +00:00
parent 6c6b9d20ca
commit b1dae28aa5
40 changed files with 577 additions and 784 deletions
+41 -56
View File
@@ -1,30 +1,28 @@
# Hetzner Kubernetes Cluster
# Proxmox Kubernetes Cluster
Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible.
Production-ready private Kubernetes cluster on Proxmox using Terraform, Ansible, and Flux.
## Architecture
| Component | Details |
|-----------|---------|
| **Control Plane** | 3x CX23 (HA) |
| **Workers** | 3x CX33 |
| **Control Plane** | 3x Proxmox VMs (2 vCPU / 4 GiB / 32 GiB) |
| **Workers** | 5x Proxmox VMs (4 vCPU / 8 GiB / 64 GiB) |
| **K8s** | k3s (latest, HA) |
| **Addons** | Hetzner CCM + CSI + Prometheus + Grafana + Loki |
| **Addons** | NFS provisioner + Prometheus + Grafana + Loki + Rancher |
| **Access** | SSH/API and private services restricted to Tailnet |
| **Bootstrap** | Terraform + Ansible + Flux |
## Prerequisites
### 1. Hetzner Cloud API Token
### 1. Proxmox API Token
1. Go to [Hetzner Cloud Console](https://console.hetzner.com/)
2. Select your project (or create a new one)
3. Navigate to **Security****API Tokens**
4. Click **Generate API Token**
5. Set description: `k8s-cluster-terraform`
6. Select permissions: **Read & Write**
7. Click **Generate API Token**
8. **Copy the token immediately** - it won't be shown again!
Create an API token for the Proxmox VE user used by Terraform. The repo expects the `bpg/proxmox` provider with:
- endpoint: `https://100.105.0.115:8006/`
- node: `flex`
- clone source: template `9000` (`ubuntu-2404-k8s-template`)
- auth: API token
### 2. Backblaze B2 Bucket (for Terraform State)
@@ -44,7 +42,7 @@ Production-ready Kubernetes cluster on Hetzner Cloud using Terraform and Ansible
### 3. SSH Key Pair
```bash
ssh-keygen -t ed25519 -C "k8s@hetzner" -f ~/.ssh/hetzner_k8s
ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra
```
### 4. Local Tools
@@ -71,10 +69,12 @@ cp terraform.tfvars.example terraform.tfvars
Edit `terraform.tfvars`:
```hcl
hcloud_token = "your-hetzner-api-token"
proxmox_endpoint = "https://100.105.0.115:8006/"
proxmox_api_token_id = "terraform-prov@pve!k8s-cluster"
proxmox_api_token_secret = "your-proxmox-token-secret"
ssh_public_key = "~/.ssh/hetzner_k8s.pub"
ssh_private_key = "~/.ssh/hetzner_k8s"
ssh_public_key = "~/.ssh/infra.pub"
ssh_private_key = "~/.ssh/infra"
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
@@ -84,12 +84,7 @@ s3_bucket = "k8s-terraform-state"
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
restrict_api_ssh_to_tailnet = true
tailnet_cidr = "100.64.0.0/10"
enable_nodeport_public = false
allowed_ssh_ips = []
allowed_api_ips = []
kube_api_vip = "10.27.27.40"
```
### 3. Initialize Terraform
@@ -152,7 +147,9 @@ Set these in your Gitea repository settings (**Settings** → **Secrets** → **
| Secret | Description |
|--------|-------------|
| `HCLOUD_TOKEN` | Hetzner Cloud API token |
| `PROXMOX_ENDPOINT` | Proxmox API endpoint (for example `https://100.105.0.115:8006/`) |
| `PROXMOX_API_TOKEN_ID` | Proxmox API token ID |
| `PROXMOX_API_TOKEN_SECRET` | Proxmox API token secret |
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`) |
@@ -163,7 +160,6 @@ Set these in your Gitea repository settings (**Settings** → **Secrets** → **
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
| `RUNNER_ALLOWED_CIDRS` | Optional CIDR list for CI runner access if you choose to pass it via tfvars/secrets |
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
@@ -176,8 +172,8 @@ This repo uses Flux for continuous reconciliation after Terraform + Ansible boot
The current default target is the HA private baseline:
- `3` control plane nodes
- `3` worker nodes
- private Hetzner network only
- `5` worker nodes
- private Proxmox network only
- Tailscale for operator and service access
- Flux-managed platform addons with `apps` suspended by default
@@ -207,8 +203,7 @@ Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed
### Reconciliation graph
- `infrastructure` (top-level)
- `addon-ccm`
- `addon-csi` depends on `addon-ccm`
- `addon-nfs-storage`
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
@@ -224,7 +219,7 @@ Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed
### Current addon status
- Core infrastructure addons are Flux-managed from `infrastructure/addons/`.
- Active Flux addons for the current baseline: `addon-ccm`, `addon-csi`, `addon-cert-manager`, `addon-external-secrets`, `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-observability`, `addon-observability-content`, `addon-rancher`, `addon-rancher-config`, `addon-rancher-backup`, `addon-rancher-backup-config`.
- Active Flux addons for the current baseline: `addon-nfs-storage`, `addon-cert-manager`, `addon-external-secrets`, `addon-tailscale-operator`, `addon-tailscale-proxyclass`, `addon-observability`, `addon-observability-content`, `addon-rancher`, `addon-rancher-config`, `addon-rancher-backup`, `addon-rancher-backup-config`.
- `apps` remains suspended until workload rollout is explicitly enabled.
- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
@@ -232,14 +227,14 @@ Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed
### Rancher access
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/`.
- The public Hetzner load balancer path is not used for Rancher.
- Rancher and the Kubernetes API stay private; kube-vip provides the API VIP on the LAN.
- Rancher stores state in embedded etcd; no external database is used.
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
- Terraform create succeeds for the default `3` control planes and `3` workers.
- Terraform create succeeds for the default `3` control planes and `5` workers.
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready`.
- Flux core reconciliation is healthy.
@@ -323,9 +318,6 @@ It avoids full cluster provisioning and only applies Grafana content resources:
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── network.tf
│ ├── firewall.tf
│ ├── ssh.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
@@ -353,17 +345,19 @@ It avoids full cluster provisioning and only applies Grafana content resources:
## Firewall Rules
This repo no longer manages cloud firewalls. Access control is expected to be handled on your LAN infrastructure and through Tailscale.
Important cluster-local ports still in use:
| Port | Source | Purpose |
|------|--------|---------|
| 22 | Tailnet CIDR | SSH |
| 6443 | Tailnet CIDR + internal | Kubernetes API |
| 41641/udp | Any | Tailscale WireGuard |
| 9345 | 10.0.0.0/16 | k3s Supervisor (HA join) |
| 2379 | 10.0.0.0/16 | etcd Client |
| 2380 | 10.0.0.0/16 | etcd Peer |
| 8472 | 10.0.0.0/16 | Flannel VXLAN |
| 10250 | 10.0.0.0/16 | Kubelet |
| 30000-32767 | Optional | NodePorts (disabled by default) |
| 22 | Admin hosts / CI | SSH |
| 6443 | 10.27.27.0/24 + VIP | Kubernetes API |
| 9345 | 10.27.27.0/24 | k3s Supervisor |
| 2379 | 10.27.27.0/24 | etcd Client |
| 2380 | 10.27.27.0/24 | etcd Peer |
| 8472/udp | 10.27.27.0/24 | Flannel VXLAN |
| 10250 | 10.27.27.0/24 | Kubelet |
## Operations
@@ -399,7 +393,7 @@ terraform destroy
### Check k3s Logs
```bash
ssh root@<control-plane-ip> journalctl -u k3s -f
ssh ubuntu@<control-plane-ip> sudo journalctl -u k3s -f
```
### Reset k3s
@@ -408,19 +402,10 @@ ssh root@<control-plane-ip> journalctl -u k3s -f
ansible-playbook site.yml -t reset
```
## Costs Breakdown
| Resource | Quantity | Unit Price | Monthly |
|----------|----------|------------|---------|
| CX23 (Control Plane) | 3 | €2.99 | €8.97 |
| CX33 (Workers) | 4 | €4.99 | €19.96 |
| Backblaze B2 | ~1 GB | Free (first 10GB) | €0.00 |
| **Total** | | | **€28.93/mo** |
## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
- Consider adding Hetzner load balancer for API server
- Kubernetes API HA is provided by kube-vip on `10.27.27.40`
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production