2026-04-22 03:02:13 +00:00
# Proxmox Kubernetes Cluster
2026-02-28 20:24:55 +00:00
2026-04-22 03:02:13 +00:00
Production-ready private Kubernetes cluster on Proxmox using Terraform, Ansible, and Flux.
2026-02-28 20:24:55 +00:00
## Architecture
| Component | Details |
|-----------|---------|
2026-04-22 03:02:13 +00:00
| **Control Plane ** | 3x Proxmox VMs (2 vCPU / 4 GiB / 32 GiB) |
| **Workers ** | 5x Proxmox VMs (4 vCPU / 8 GiB / 64 GiB) |
2026-02-28 20:24:55 +00:00
| **K8s ** | k3s (latest, HA) |
2026-04-22 03:02:13 +00:00
| **Addons ** | NFS provisioner + Prometheus + Grafana + Loki + Rancher |
2026-04-18 19:59:13 +00:00
| **Access ** | SSH/API and private services restricted to Tailnet |
| **Bootstrap ** | Terraform + Ansible + Flux |
2026-02-28 20:24:55 +00:00
## Prerequisites
2026-04-22 03:02:13 +00:00
### 1. Proxmox API Token
2026-02-28 20:24:55 +00:00
2026-04-22 03:02:13 +00:00
Create an API token for the Proxmox VE user used by Terraform. The repo expects the `bpg/proxmox` provider with:
- endpoint: `https://100.105.0.115:8006/`
- node: `flex`
- clone source: template `9000` (`ubuntu-2404-k8s-template` )
- auth: API token
2026-02-28 20:24:55 +00:00
### 2. Backblaze B2 Bucket (for Terraform State)
1. Go to [Backblaze B2 ](https://secure.backblaze.com/b2_buckets.htm )
2. Click **Create a Bucket **
3. Set bucket name: `k8s-terraform-state` (must be globally unique)
4. Choose **Private ** access
5. Click **Create Bucket **
6. Create application key:
- Go to **App Keys ** → **Add a New Application Key **
- Name: `terraform-state`
- Allow access to: `k8s-terraform-state` bucket only
- Type: **Read and Write **
- Copy **keyID ** (access key) and **applicationKey ** (secret key)
7. Note your bucket's S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` )
### 3. SSH Key Pair
``` bash
2026-04-22 03:02:13 +00:00
ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra
2026-02-28 20:24:55 +00:00
```
### 4. Local Tools
- [Terraform ](https://terraform.io/downloads ) >= 1.0
- [Ansible ](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html ) >= 2.9
- Python 3 with `jinja2` and `pyyaml`
## Setup
### 1. Clone Repository
``` bash
git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra
```
### 2. Configure Variables
``` bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars` :
``` hcl
2026-04-22 03:02:13 +00:00
proxmox_endpoint = "https://100.105.0.115:8006/"
proxmox_api_token_id = "terraform-prov@pve!k8s-cluster"
proxmox_api_token_secret = "your-proxmox-token-secret"
2026-02-28 20:24:55 +00:00
2026-04-22 03:02:13 +00:00
ssh_public_key = "~/.ssh/infra.pub"
ssh_private_key = "~/.ssh/infra"
2026-02-28 20:24:55 +00:00
s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket = "k8s-terraform-state"
2026-03-01 04:04:56 +00:00
tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet = "yourtailnet.ts.net"
2026-04-22 03:02:13 +00:00
kube_api_vip = "10.27.27.40"
2026-02-28 20:24:55 +00:00
```
### 3. Initialize Terraform
``` bash
cd terraform
# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
2026-02-28 21:01:58 +00:00
endpoint = "https://s3.eu-central-003.backblazeb2.com"
2026-02-28 20:58:40 +00:00
bucket = "k8s-terraform-state"
access_key = "your-backblaze-key-id"
secret_key = "your-backblaze-application-key"
skip_requesting_account_id = true
2026-02-28 20:24:55 +00:00
EOF
terraform init -backend-config= backend.hcl
```
### 4. Plan and Apply
``` bash
terraform plan -var-file= ../terraform.tfvars
terraform apply -var-file= ../terraform.tfvars
```
### 5. Generate Ansible Inventory
``` bash
cd ../ansible
python3 generate_inventory.py
```
### 6. Bootstrap Cluster
``` bash
ansible-playbook site.yml
```
### 7. Get Kubeconfig
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl get nodes
```
2026-04-18 19:59:13 +00:00
Use `scripts/refresh-kubeconfig.sh <cp1-public-ip>` to refresh kubeconfig against the primary control-plane public IP after rebuilds.
2026-03-01 17:12:12 +00:00
2026-02-28 20:24:55 +00:00
## Gitea CI/CD
This repository includes Gitea workflows for:
2026-04-18 19:59:13 +00:00
- **deploy**: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
- **destroy**: Cluster teardown with backup-aware cleanup
2026-03-04 03:36:01 +00:00
- **dashboards**: Fast workflow that updates Grafana datasources/dashboards only
2026-02-28 20:24:55 +00:00
### Required Gitea Secrets
Set these in your Gitea repository settings (**Settings** → **Secrets ** → **Actions ** ):
| Secret | Description |
|--------|-------------|
2026-04-22 03:02:13 +00:00
| `PROXMOX_ENDPOINT` | Proxmox API endpoint (for example `https://100.105.0.115:8006/` ) |
| `PROXMOX_API_TOKEN_ID` | Proxmox API token ID |
| `PROXMOX_API_TOKEN_SECRET` | Proxmox API token secret |
2026-02-28 20:24:55 +00:00
| `S3_ACCESS_KEY` | Backblaze B2 keyID |
| `S3_SECRET_KEY` | Backblaze B2 applicationKey |
| `S3_ENDPOINT` | Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com` ) |
| `S3_BUCKET` | S3 bucket name (e.g., `k8s-terraform-state` ) |
2026-03-01 04:04:56 +00:00
| `TAILSCALE_AUTH_KEY` | Tailscale auth key for node bootstrap |
| `TAILSCALE_TAILNET` | Tailnet domain (e.g., `yourtailnet.ts.net` ) |
2026-03-04 01:37:00 +00:00
| `TAILSCALE_OAUTH_CLIENT_ID` | Tailscale OAuth client ID for Kubernetes Operator |
| `TAILSCALE_OAUTH_CLIENT_SECRET` | Tailscale OAuth client secret for Kubernetes Operator |
2026-03-09 00:25:41 +00:00
| `DOPPLER_HETZNERTERRA_SERVICE_TOKEN` | Doppler service token for `hetznerterra` runtime secrets |
2026-03-02 01:33:41 +00:00
| `GRAFANA_ADMIN_PASSWORD` | Optional admin password for Grafana (auto-generated if unset) |
2026-02-28 20:24:55 +00:00
| `SSH_PUBLIC_KEY` | SSH public key content |
| `SSH_PRIVATE_KEY` | SSH private key content |
2026-03-04 23:38:40 +00:00
## GitOps (Flux)
2026-03-17 01:21:23 +00:00
This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.
2026-03-04 23:38:40 +00:00
2026-03-20 02:24:37 +00:00
### Stable private-only baseline
2026-04-18 19:59:13 +00:00
The current default target is the HA private baseline:
2026-03-20 02:24:37 +00:00
2026-04-18 19:59:13 +00:00
- `3` control plane nodes
2026-04-22 03:02:13 +00:00
- `5` worker nodes
- private Proxmox network only
2026-04-18 19:59:13 +00:00
- Tailscale for operator and service access
- Flux-managed platform addons with `apps` suspended by default
2026-03-20 02:24:37 +00:00
Detailed phase gates and success criteria live in `STABLE_BASELINE.md` .
This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.
2026-03-09 00:25:41 +00:00
### Runtime secrets
Runtime cluster secrets are moving to Doppler + External Secrets Operator.
- Doppler project: `hetznerterra`
- Initial auth: service token via `DOPPLER_HETZNERTERRA_SERVICE_TOKEN`
- First synced secrets:
- `GRAFANA_ADMIN_PASSWORD`
Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.
2026-03-04 23:38:40 +00:00
### Repository layout
- `clusters/prod/` : cluster entrypoint and Flux reconciliation objects
- `clusters/prod/flux-system/` : `GitRepository` source and top-level `Kustomization` graph
- `infrastructure/` : infrastructure addon reconciliation graph
2026-03-17 01:04:02 +00:00
- `infrastructure/addons/*` : per-addon manifests for Flux-managed cluster addons
2026-03-04 23:38:40 +00:00
- `apps/` : application workload layer (currently scaffolded)
### Reconciliation graph
- `infrastructure` (top-level)
2026-04-22 03:02:13 +00:00
- `addon-nfs-storage`
2026-03-04 23:38:40 +00:00
- `addon-tailscale-operator`
- `addon-observability`
- `addon-observability-content` depends on `addon-observability`
- `apps` depends on `infrastructure`
### Bootstrap notes
1. Install Flux controllers in `flux-system` .
2. Create the Flux deploy key/secret named `flux-system` in `flux-system` namespace.
3. Apply `clusters/prod/flux-system/` once to establish source + reconciliation graph.
2026-03-17 01:04:02 +00:00
4. Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.
2026-03-04 23:38:40 +00:00
2026-03-17 01:21:23 +00:00
### Current addon status
2026-03-04 23:38:40 +00:00
2026-03-17 01:21:23 +00:00
- Core infrastructure addons are Flux-managed from `infrastructure/addons/` .
2026-04-22 03:02:13 +00:00
- Active Flux addons for the current baseline: `addon-nfs-storage` , `addon-cert-manager` , `addon-external-secrets` , `addon-tailscale-operator` , `addon-tailscale-proxyclass` , `addon-observability` , `addon-observability-content` , `addon-rancher` , `addon-rancher-config` , `addon-rancher-backup` , `addon-rancher-backup-config` .
2026-04-18 19:59:13 +00:00
- `apps` remains suspended until workload rollout is explicitly enabled.
- Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
2026-04-18 18:44:55 +00:00
- Weave GitOps / Flux UI is no longer deployed; use Rancher or the `flux` CLI for Flux operations.
2026-03-20 02:24:37 +00:00
2026-03-28 22:13:37 +00:00
### Rancher access
2026-04-18 19:59:13 +00:00
- Rancher is private-only and exposed through Tailscale at `https://rancher.silverside-gopher.ts.net/` .
2026-04-22 03:02:13 +00:00
- Rancher and the Kubernetes API stay private; kube-vip provides the API VIP on the LAN.
2026-04-18 19:59:13 +00:00
- Rancher stores state in embedded etcd; no external database is used.
2026-03-28 22:13:37 +00:00
2026-03-20 02:24:37 +00:00
### Stable baseline acceptance
A rebuild is considered successful only when all of the following pass without manual intervention:
2026-04-22 03:02:13 +00:00
- Terraform create succeeds for the default `3` control planes and `5` workers.
2026-03-20 02:24:37 +00:00
- Ansible bootstrap succeeds end-to-end.
- All nodes become `Ready` .
2026-03-22 01:04:53 +00:00
- Flux core reconciliation is healthy.
- External Secrets Operator is ready.
- Tailscale operator is ready.
2026-04-18 19:59:13 +00:00
- Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
2026-03-20 02:24:37 +00:00
- Terraform destroy succeeds cleanly or succeeds after workflow retries.
2026-03-04 23:38:40 +00:00
2026-03-02 01:33:41 +00:00
## Observability Stack
2026-03-05 00:43:29 +00:00
Flux deploys a lightweight observability stack in the `observability` namespace:
2026-03-02 01:33:41 +00:00
- `kube-prometheus-stack` (Prometheus + Grafana)
- `loki`
- `promtail`
2026-03-17 01:21:23 +00:00
Grafana content is managed as code via ConfigMaps in `infrastructure/addons/observability-content/` .
2026-03-04 03:36:01 +00:00
2026-04-18 18:44:55 +00:00
Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.
2026-03-02 01:33:41 +00:00
### Access Grafana and Prometheus
2026-03-08 04:16:06 +00:00
Preferred private access:
2026-03-04 01:37:00 +00:00
2026-04-18 18:44:55 +00:00
- Grafana: `http://grafana.silverside-gopher.ts.net/`
- Prometheus: `http://prometheus.silverside-gopher.ts.net:9090/`
2026-03-04 01:37:00 +00:00
Fallback (port-forward from a tailnet-connected machine):
2026-03-02 01:33:41 +00:00
Run from a tailnet-connected machine:
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
Then open:
- Grafana: http://127.0.0.1:3000
- Prometheus: http://127.0.0.1:9090
Grafana user: `admin`
Grafana password: value of `GRAFANA_ADMIN_PASSWORD` secret (or the generated value shown by Ansible output)
2026-03-04 01:37:00 +00:00
### Verify Tailscale exposure
``` bash
export KUBECONFIG = $( pwd ) /outputs/kubeconfig
kubectl -n tailscale-system get pods
2026-04-18 19:59:13 +00:00
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady
2026-03-04 01:37:00 +00:00
```
If `TailscaleProxyReady=False` , check:
``` bash
kubectl -n tailscale-system logs deployment/operator --tail= 100
```
Common cause: OAuth client missing tag/scopes permissions.
2026-03-04 03:36:01 +00:00
### Fast dashboard iteration workflow
Use the `Deploy Grafana Content` workflow when changing dashboard/data source templates.
It avoids full cluster provisioning and only applies Grafana content resources:
- `ansible/roles/observability-content/templates/grafana-datasources.yaml.j2`
- `ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2`
- `ansible/dashboards.yml`
2026-02-28 20:24:55 +00:00
## File Structure
```
.
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── servers.tf
│ ├── outputs.tf
│ └── backend.tf
├── ansible/
│ ├── inventory.tmpl
│ ├── generate_inventory.py
│ ├── site.yml
│ ├── roles/
│ │ ├── common/
│ │ ├── k3s-server/
│ │ ├── k3s-agent/
2026-03-17 01:04:02 +00:00
│ │ ├── addon-secrets-bootstrap/
2026-03-04 03:36:01 +00:00
│ │ ├── observability-content/
2026-03-02 01:33:41 +00:00
│ │ └── observability/
2026-02-28 20:24:55 +00:00
│ └── ansible.cfg
├── .gitea/
│ └── workflows/
│ ├── terraform.yml
2026-03-04 03:36:01 +00:00
│ ├── ansible.yml
│ └── dashboards.yml
2026-02-28 20:24:55 +00:00
├── outputs/
├── terraform.tfvars.example
└── README.md
```
## Firewall Rules
2026-04-22 03:02:13 +00:00
This repo no longer manages cloud firewalls. Access control is expected to be handled on your LAN infrastructure and through Tailscale.
Important cluster-local ports still in use:
2026-02-28 20:24:55 +00:00
| Port | Source | Purpose |
|------|--------|---------|
2026-04-22 03:02:13 +00:00
| 22 | Admin hosts / CI | SSH |
| 6443 | 10.27.27.0/24 + VIP | Kubernetes API |
| 9345 | 10.27.27.0/24 | k3s Supervisor |
| 2379 | 10.27.27.0/24 | etcd Client |
| 2380 | 10.27.27.0/24 | etcd Peer |
| 8472/udp | 10.27.27.0/24 | Flannel VXLAN |
| 10250 | 10.27.27.0/24 | Kubelet |
2026-02-28 20:24:55 +00:00
## Operations
### Scale Workers
Edit `terraform.tfvars` :
``` hcl
worker_count = 5
```
Then:
``` bash
terraform apply
ansible-playbook site.yml
```
### Upgrade k3s
``` bash
ansible-playbook site.yml -t upgrade
```
### Destroy Cluster
``` bash
terraform destroy
```
## Troubleshooting
### Check k3s Logs
``` bash
2026-04-22 03:02:13 +00:00
ssh ubuntu@<control-plane-ip> sudo journalctl -u k3s -f
2026-02-28 20:24:55 +00:00
```
### Reset k3s
``` bash
ansible-playbook site.yml -t reset
```
## Security Notes
- Control plane has HA (3 nodes, can survive 1 failure)
2026-04-22 03:02:13 +00:00
- Kubernetes API HA is provided by kube-vip on `10.27.27.40`
2026-02-28 20:24:55 +00:00
- Rotate API tokens regularly
- Use network policies in Kubernetes
- Enable audit logging for production
## License
MIT