OpenStaticFish/HetznerTerra

Fork 0

T

micqdf 098bd98876

Deploy Cluster / Terraform (push) Successful in 26s

Details

Deploy Cluster / Ansible (push) Failing after 25m19s

Details

fix: wait on Rancher and storage runtime objects during bootstrap

Flux can leave HelmRelease and Kustomization conditions stale after transient
chart fetch or image pull failures even when the underlying workloads recover.
Switch the deploy workflow to wait on the concrete runtime resources we care
about: the NFS provisioner deployment and StorageClass, Rancher deployment,
webhook, cert-manager issuer/certificate, and the rancher-backup deployment.

2026-04-22 18:41:09 +00:00

.gitea/workflows

fix: wait on Rancher and storage runtime objects during bootstrap

2026-04-22 18:41:09 +00:00

ansible

fix: make Rancher image pre-pull best effort and disable managed SUC

2026-04-22 11:33:13 +00:00

apps

feat: migrate observability stack to flux gitops

2026-03-04 23:38:40 +00:00

clusters/prod

fix: tolerate control-plane taint when pinning Flux to cp1

2026-04-22 05:05:15 +00:00

infrastructure

fix: make Rancher image pre-pull best effort and disable managed SUC

2026-04-22 11:33:13 +00:00

scripts

feat: migrate cluster baseline from Hetzner to Proxmox

2026-04-22 03:02:13 +00:00

terraform

fix: skip clone storage override for linked Proxmox clones

2026-04-22 03:22:50 +00:00

.gitignore

fix: Remove backend.hcl from git, add to gitignore

2026-02-28 21:12:56 +00:00

AGENTS.md

fix: point NFS provisioner at active Proxmox host export

2026-04-22 09:46:01 +00:00

README.md

feat: migrate cluster baseline from Hetzner to Proxmox

2026-04-22 03:02:13 +00:00

SECRETS_SETUP.md

feat: migrate cluster baseline from Hetzner to Proxmox

2026-04-22 03:02:13 +00:00

STABLE_BASELINE.md

feat: migrate cluster baseline from Hetzner to Proxmox

2026-04-22 03:02:13 +00:00

terraform.tfvars.example

feat: migrate cluster baseline from Hetzner to Proxmox

2026-04-22 03:02:13 +00:00

README.md

Proxmox Kubernetes Cluster

Production-ready private Kubernetes cluster on Proxmox using Terraform, Ansible, and Flux.

Architecture

Component	Details
Control Plane	3x Proxmox VMs (2 vCPU / 4 GiB / 32 GiB)
Workers	5x Proxmox VMs (4 vCPU / 8 GiB / 64 GiB)
K8s	k3s (latest, HA)
Addons	NFS provisioner + Prometheus + Grafana + Loki + Rancher
Access	SSH/API and private services restricted to Tailnet
Bootstrap	Terraform + Ansible + Flux

Prerequisites

1. Proxmox API Token

Create an API token for the Proxmox VE user used by Terraform. The repo expects the bpg/proxmox provider with:

endpoint: https://100.105.0.115:8006/
node: flex
clone source: template 9000 (ubuntu-2404-k8s-template)
auth: API token

2. Backblaze B2 Bucket (for Terraform State)

Go to Backblaze B2
Click Create a Bucket
Set bucket name: k8s-terraform-state (must be globally unique)
Choose Private access
Click Create Bucket
Create application key:
- Go to App Keys → Add a New Application Key
- Name: terraform-state
- Allow access to: k8s-terraform-state bucket only
- Type: Read and Write
- Copy keyID (access key) and applicationKey (secret key)
Note your bucket's S3 endpoint (e.g., https://s3.eu-central-003.backblazeb2.com)

3. SSH Key Pair

ssh-keygen -t ed25519 -C "k8s@proxmox" -f ~/.ssh/infra

4. Local Tools

Terraform >= 1.0
Ansible >= 2.9
Python 3 with jinja2 and pyyaml

Setup

1. Clone Repository

git clone <your-gitea-repo>/HetznerTerra.git
cd HetznerTerra

2. Configure Variables

cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars:

 proxmox_endpoint         = "https://100.105.0.115:8006/"
 proxmox_api_token_id     = "terraform-prov@pve!k8s-cluster"
 proxmox_api_token_secret = "your-proxmox-token-secret"

 ssh_public_key  = "~/.ssh/infra.pub"
 ssh_private_key = "~/.ssh/infra"

s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket     = "k8s-terraform-state"

tailscale_auth_key = "tskey-auth-..."
tailscale_tailnet  = "yourtailnet.ts.net"

 kube_api_vip = "10.27.27.40"

3. Initialize Terraform

cd terraform

# Create backend config file (or use CLI args)
cat > backend.hcl << EOF
endpoint                    = "https://s3.eu-central-003.backblazeb2.com"
bucket                      = "k8s-terraform-state"
access_key                  = "your-backblaze-key-id"
secret_key                  = "your-backblaze-application-key"
skip_requesting_account_id  = true
EOF

terraform init -backend-config=backend.hcl

4. Plan and Apply

terraform plan -var-file=../terraform.tfvars
terraform apply -var-file=../terraform.tfvars

5. Generate Ansible Inventory

cd ../ansible
python3 generate_inventory.py

6. Bootstrap Cluster

ansible-playbook site.yml

7. Get Kubeconfig

export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes

Use scripts/refresh-kubeconfig.sh <cp1-public-ip> to refresh kubeconfig against the primary control-plane public IP after rebuilds.

Gitea CI/CD

This repository includes Gitea workflows for:

deploy: End-to-end Terraform + Ansible + Flux bootstrap + restore + health checks
destroy: Cluster teardown with backup-aware cleanup
dashboards: Fast workflow that updates Grafana datasources/dashboards only

Required Gitea Secrets

Set these in your Gitea repository settings (Settings → Secrets → Actions):

Secret	Description
`PROXMOX_ENDPOINT`	Proxmox API endpoint (for example `https://100.105.0.115:8006/`)
`PROXMOX_API_TOKEN_ID`	Proxmox API token ID
`PROXMOX_API_TOKEN_SECRET`	Proxmox API token secret
`S3_ACCESS_KEY`	Backblaze B2 keyID
`S3_SECRET_KEY`	Backblaze B2 applicationKey
`S3_ENDPOINT`	Backblaze S3 endpoint (e.g., `https://s3.eu-central-003.backblazeb2.com`)
`S3_BUCKET`	S3 bucket name (e.g., `k8s-terraform-state`)
`TAILSCALE_AUTH_KEY`	Tailscale auth key for node bootstrap
`TAILSCALE_TAILNET`	Tailnet domain (e.g., `yourtailnet.ts.net`)
`TAILSCALE_OAUTH_CLIENT_ID`	Tailscale OAuth client ID for Kubernetes Operator
`TAILSCALE_OAUTH_CLIENT_SECRET`	Tailscale OAuth client secret for Kubernetes Operator
`DOPPLER_HETZNERTERRA_SERVICE_TOKEN`	Doppler service token for `hetznerterra` runtime secrets
`GRAFANA_ADMIN_PASSWORD`	Optional admin password for Grafana (auto-generated if unset)
`SSH_PUBLIC_KEY`	SSH public key content
`SSH_PRIVATE_KEY`	SSH private key content

GitOps (Flux)

This repo uses Flux for continuous reconciliation after Terraform + Ansible bootstrap.

Stable private-only baseline

The current default target is the HA private baseline:

3 control plane nodes
5 worker nodes
private Proxmox network only
Tailscale for operator and service access
Flux-managed platform addons with apps suspended by default

Detailed phase gates and success criteria live in STABLE_BASELINE.md.

This is the default until rebuilds are consistently green. High availability, public ingress, and app-layer expansion come later.

Runtime secrets

Runtime cluster secrets are moving to Doppler + External Secrets Operator.

Doppler project: hetznerterra
Initial auth: service token via DOPPLER_HETZNERTERRA_SERVICE_TOKEN
First synced secrets:
- GRAFANA_ADMIN_PASSWORD

Terraform/bootstrap secrets remain in Gitea Actions secrets and are not managed by Doppler.

Repository layout

clusters/prod/: cluster entrypoint and Flux reconciliation objects
clusters/prod/flux-system/: GitRepository source and top-level Kustomization graph
infrastructure/: infrastructure addon reconciliation graph
infrastructure/addons/*: per-addon manifests for Flux-managed cluster addons
apps/: application workload layer (currently scaffolded)

Reconciliation graph

infrastructure (top-level)
- addon-nfs-storage
- addon-tailscale-operator
- addon-observability
- addon-observability-content depends on addon-observability
apps depends on infrastructure

Bootstrap notes

Install Flux controllers in flux-system.
Create the Flux deploy key/secret named flux-system in flux-system namespace.
Apply clusters/prod/flux-system/ once to establish source + reconciliation graph.
Bootstrap-only Ansible creates prerequisite secrets; Flux manages addon lifecycle after bootstrap.

Current addon status

Core infrastructure addons are Flux-managed from infrastructure/addons/.
Active Flux addons for the current baseline: addon-nfs-storage, addon-cert-manager, addon-external-secrets, addon-tailscale-operator, addon-tailscale-proxyclass, addon-observability, addon-observability-content, addon-rancher, addon-rancher-config, addon-rancher-backup, addon-rancher-backup-config.
apps remains suspended until workload rollout is explicitly enabled.
Ansible is limited to cluster bootstrap, prerequisite secret creation, pre-proxy Tailscale cleanup, and kubeconfig finalization.
Weave GitOps / Flux UI is no longer deployed; use Rancher or the flux CLI for Flux operations.

Rancher access

Rancher is private-only and exposed through Tailscale at https://rancher.silverside-gopher.ts.net/.
Rancher and the Kubernetes API stay private; kube-vip provides the API VIP on the LAN.
Rancher stores state in embedded etcd; no external database is used.

Stable baseline acceptance

A rebuild is considered successful only when all of the following pass without manual intervention:

Terraform create succeeds for the default 3 control planes and 5 workers.
Ansible bootstrap succeeds end-to-end.
All nodes become Ready.
Flux core reconciliation is healthy.
External Secrets Operator is ready.
Tailscale operator is ready.
Tailnet smoke checks pass for Rancher, Grafana, and Prometheus.
Terraform destroy succeeds cleanly or succeeds after workflow retries.

Observability Stack

Flux deploys a lightweight observability stack in the observability namespace:

kube-prometheus-stack (Prometheus + Grafana)
loki
promtail

Grafana content is managed as code via ConfigMaps in infrastructure/addons/observability-content/.

Grafana and Prometheus are exposed through dedicated Tailscale LoadBalancer services when the Tailscale Kubernetes Operator is healthy.

Access Grafana and Prometheus

Preferred private access:

Grafana: http://grafana.silverside-gopher.ts.net/
Prometheus: http://prometheus.silverside-gopher.ts.net:9090/

Fallback (port-forward from a tailnet-connected machine):

Run from a tailnet-connected machine:

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090

Then open:

Grafana: http://127.0.0.1:3000
Prometheus: http://127.0.0.1:9090

Grafana user: admin Grafana password: value of GRAFANA_ADMIN_PASSWORD secret (or the generated value shown by Ansible output)

Verify Tailscale exposure

export KUBECONFIG=$(pwd)/outputs/kubeconfig

kubectl -n tailscale-system get pods
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady

If TailscaleProxyReady=False, check:

kubectl -n tailscale-system logs deployment/operator --tail=100

Common cause: OAuth client missing tag/scopes permissions.

Fast dashboard iteration workflow

Use the Deploy Grafana Content workflow when changing dashboard/data source templates. It avoids full cluster provisioning and only applies Grafana content resources:

ansible/roles/observability-content/templates/grafana-datasources.yaml.j2
ansible/roles/observability-content/templates/grafana-dashboard-k8s-overview.yaml.j2
ansible/dashboards.yml

File Structure

.
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── servers.tf
│   ├── outputs.tf
│   └── backend.tf
├── ansible/
│   ├── inventory.tmpl
│   ├── generate_inventory.py
│   ├── site.yml
│   ├── roles/
│   │   ├── common/
│   │   ├── k3s-server/
│   │   ├── k3s-agent/
│   │   ├── addon-secrets-bootstrap/
│   │   ├── observability-content/
│   │   └── observability/
│   └── ansible.cfg
├── .gitea/
│   └── workflows/
│       ├── terraform.yml
│       ├── ansible.yml
│       └── dashboards.yml
├── outputs/
├── terraform.tfvars.example
└── README.md

Firewall Rules

This repo no longer manages cloud firewalls. Access control is expected to be handled on your LAN infrastructure and through Tailscale.

Important cluster-local ports still in use:

Port	Source	Purpose
22	Admin hosts / CI	SSH
6443	10.27.27.0/24 + VIP	Kubernetes API
9345	10.27.27.0/24	k3s Supervisor
2379	10.27.27.0/24	etcd Client
2380	10.27.27.0/24	etcd Peer
8472/udp	10.27.27.0/24	Flannel VXLAN
10250	10.27.27.0/24	Kubelet

Operations

Scale Workers

Edit terraform.tfvars:

worker_count = 5

Then:

terraform apply
ansible-playbook site.yml

Upgrade k3s

ansible-playbook site.yml -t upgrade

Destroy Cluster

terraform destroy

Troubleshooting

Check k3s Logs

ssh ubuntu@<control-plane-ip> sudo journalctl -u k3s -f

Reset k3s

ansible-playbook site.yml -t reset

Security Notes

Control plane has HA (3 nodes, can survive 1 failure)
Kubernetes API HA is provided by kube-vip on 10.27.27.40
Rotate API tokens regularly
Use network policies in Kubernetes
Enable audit logging for production

License

MIT

Languages

Go Template 81.5%

HCL 5.5%

Python 4.7%

Shell 4.3%

Jinja 4%