micqdf a33a993867
Deploy Grafana Content / Grafana Content (push) Failing after 1m14s
Deploy Cluster / Terraform (push) Failing after 4m59s
Deploy Cluster / Ansible (push) Has been skipped
fix: harden cluster rebuild determinism
2026-04-30 07:36:27 +00:00
2026-04-30 07:03:21 +00:00
2026-04-30 07:03:21 +00:00

Proxmox Kubernetes Cluster

Private HA K3s cluster on Proxmox, provisioned by Terraform, bootstrapped by Ansible, and reconciled by Flux.

Architecture

Component Current Baseline
Control plane 3 Proxmox VMs, VMIDs 200-202, IPs 10.27.27.30-32, 2 vCPU / 4 GiB / 32 GiB
Workers 5 Proxmox VMs, VMIDs 210-214, IPs 10.27.27.41-45, 4 vCPU / 8 GiB / 64 GiB
Kubernetes K3s v1.34.6+k3s1, HA embedded etcd, kube-vip API VIP 10.27.27.40
Proxmox Node flex, template VMID 9000, datastore Flash, bridge vmbr0
Storage Raw-manifest nfs-subdir-external-provisioner, 10.27.27.239:/TheFlash/k8s-nfs, default StorageClass flash-nfs
GitOps Flux source platform on branch main; apps Kustomization is intentionally suspended
Private access Tailscale operator exposes Rancher, Grafana, and Prometheus; no public ingress baseline
Runtime secrets Doppler service token bootstraps External Secrets Operator

K3s is pinned because Rancher chart 2.13.3 requires Kubernetes <1.35.0-0.

Prerequisites

  • Terraform >= 1.0.
  • Ansible with Python jinja2 and pyyaml.
  • kubectl for local verification.
  • Proxmox API token for the bpg/proxmox provider.
  • S3-compatible bucket for Terraform state, currently Backblaze B2.
  • SSH key pair available to Terraform and Ansible, defaulting to ~/.ssh/infra and ~/.ssh/infra.pub.

Expected Proxmox inputs:

Setting Value
Endpoint https://100.105.0.115:8006/
Node flex
Clone source Template VMID 9000 (ubuntu-2404-k8s-template)
Storage Flash

Local Setup

Create local variables from the example:

cp terraform.tfvars.example terraform.tfvars

Important defaults in terraform.tfvars.example:

proxmox_endpoint         = "https://100.105.0.115:8006/"
proxmox_api_token_id     = "terraform-prov@pve!k8s-cluster"
proxmox_api_token_secret = "your-proxmox-api-token-secret"

ssh_public_key  = "~/.ssh/infra.pub"
ssh_private_key = "~/.ssh/infra"

s3_access_key = "your-backblaze-key-id"
s3_secret_key = "your-backblaze-application-key"
s3_endpoint   = "https://s3.eu-central-003.backblazeb2.com"
s3_bucket     = "k8s-terraform-state"

tailscale_tailnet = "yourtailnet.ts.net"
kube_api_vip     = "10.27.27.40"

Initialize Terraform with backend credentials:

terraform -chdir=terraform init \
  -backend-config="endpoint=<s3-endpoint>" \
  -backend-config="bucket=<s3-bucket>" \
  -backend-config="region=auto" \
  -backend-config="access_key=<s3-access-key>" \
  -backend-config="secret_key=<s3-secret-key>" \
  -backend-config="skip_requesting_account_id=true"

Common Commands

Terraform:

terraform -chdir=terraform fmt -recursive
terraform -chdir=terraform validate
terraform -chdir=terraform plan -var-file=../terraform.tfvars
terraform -chdir=terraform apply -var-file=../terraform.tfvars

Ansible setup:

ansible-galaxy collection install -r ansible/requirements.yml
cd ansible
python3 generate_inventory.py
ansible-playbook site.yml --syntax-check

Manual Ansible bootstrap uses the same extra vars as the deploy workflow:

cd ansible
ansible-playbook site.yml \
  -e "tailscale_auth_key=$TAILSCALE_AUTH_KEY" \
  -e "tailscale_tailnet=$TAILSCALE_TAILNET" \
  -e "tailscale_oauth_client_id=$TAILSCALE_OAUTH_CLIENT_ID" \
  -e "tailscale_oauth_client_secret=$TAILSCALE_OAUTH_CLIENT_SECRET" \
  -e "doppler_hetznerterra_service_token=$DOPPLER_HETZNERTERRA_SERVICE_TOKEN" \
  -e "tailscale_api_key=${TAILSCALE_API_KEY:-}" \
  -e "grafana_admin_password=${GRAFANA_ADMIN_PASSWORD:-}" \
  -e "cluster_name=k8s-cluster"

Flux/Kustomize verification:

kubectl kustomize infrastructure/addons/<addon>
kubectl kustomize infrastructure/addons
kubectl kustomize clusters/prod/flux-system

Refresh kubeconfig after rebuilds:

scripts/refresh-kubeconfig.sh 10.27.27.30
export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl get nodes

Run the tailnet smoke check from cp1:

ssh ubuntu@10.27.27.30 'bash -s' < scripts/smoke-check-tailnet-services.sh

Gitea CI/CD

The supported full rebuild path is the Gitea deploy workflow.

Workflow Trigger Purpose
.gitea/workflows/deploy.yml PR to main, push to main, manual dispatch PRs run Terraform plan; pushes run Terraform apply, Ansible bootstrap, Flux bootstrap, addon gates, health checks, and tailnet smoke checks
.gitea/workflows/destroy.yml Manual dispatch with confirm: destroy Terraform destroy with retries; no Rancher backup gate
.gitea/workflows/dashboards.yml Grafana content changes or manual dispatch Fast Grafana datasource/dashboard update through ansible/dashboards.yml

Deploy and destroy share concurrency.group: prod-cluster so they do not run at the same time.

Deploy sequence on push to main:

  1. Terraform fmt/init/validate/plan/apply.
  2. Cleanup/retry around known transient Proxmox clone and disk-update failures.
  3. Generate Ansible inventory from Terraform outputs.
  4. Prepare critical image archives with skopeo on the runner.
  5. Run ansible/site.yml to bootstrap nodes, K3s, kube-vip, prerequisite secrets, and kubeconfig.
  6. Apply Flux CRDs/controllers and the clusters/prod/flux-system graph.
  7. Gate cert-manager, External Secrets, Tailscale, NFS, Rancher, and observability.
  8. Run post-deploy health checks and Tailscale service smoke checks.

Required Gitea secrets:

Secret Description
PROXMOX_ENDPOINT Proxmox API endpoint, for example https://100.105.0.115:8006/
PROXMOX_API_TOKEN_ID Proxmox API token ID
PROXMOX_API_TOKEN_SECRET Proxmox API token secret
S3_ACCESS_KEY S3/Backblaze access key for Terraform state
S3_SECRET_KEY S3/Backblaze secret key for Terraform state
S3_ENDPOINT S3 endpoint, for example https://s3.eu-central-003.backblazeb2.com
S3_BUCKET Terraform state bucket, for example k8s-terraform-state
TAILSCALE_AUTH_KEY Tailscale auth key for node bootstrap
TAILSCALE_TAILNET Tailnet domain, for example silverside-gopher.ts.net
TAILSCALE_OAUTH_CLIENT_ID Tailscale OAuth client ID for the Kubernetes operator
TAILSCALE_OAUTH_CLIENT_SECRET Tailscale OAuth client secret for the Kubernetes operator
TAILSCALE_API_KEY Optional API key used to delete stale offline reserved devices before service proxies exist
DOPPLER_HETZNERTERRA_SERVICE_TOKEN Doppler service token for runtime cluster secrets
GRAFANA_ADMIN_PASSWORD Optional Grafana admin password
SSH_PUBLIC_KEY SSH public key content
SSH_PRIVATE_KEY SSH private key content

GitOps Graph

Flux entrypoint:

clusters/prod/flux-system/
├── gotk-components.yaml
├── gitrepository-platform.yaml
├── kustomization-infrastructure.yaml
└── kustomization-apps.yaml  # suspend: true

Active infrastructure addons from infrastructure/addons/kustomization.yaml:

  • addon-nfs-storage
  • addon-external-secrets
  • addon-cert-manager
  • addon-tailscale-operator
  • addon-tailscale-proxyclass
  • traefik HelmRelease manifests applied directly by the top-level infrastructure Kustomization
  • addon-observability
  • addon-observability-content
  • addon-rancher
  • addon-rancher-config

Chart/source strategy:

  • Vendored charts are intentional: cert-manager, traefik, kube-prometheus-stack, tailscale-operator, and rancher live under infrastructure/charts/.
  • External Secrets, Loki, and Promtail use Flux OCIRepository sources.
  • NFS storage is raw Kubernetes manifests, not a Helm chart.
  • Rancher backup/restore is not part of the current live graph.

Doppler bootstrap details:

  • ansible/roles/doppler-bootstrap creates the external-secrets namespace and the Doppler token secret only.
  • The deploy workflow creates ClusterSecretStore/doppler-hetznerterra after ESO CRDs and webhook endpoints exist.
  • The checked-in infrastructure/addons/external-secrets/clustersecretstore-doppler-hetznerterra.yaml is not included by the addon kustomization.

Access URLs

Service URL
Rancher https://rancher.silverside-gopher.ts.net/
Grafana http://grafana.silverside-gopher.ts.net/
Prometheus http://prometheus.silverside-gopher.ts.net:9090/

Fallback port-forward from a tailnet-connected machine:

export KUBECONFIG=$(pwd)/outputs/kubeconfig
kubectl -n observability port-forward svc/kube-prometheus-stack-grafana 3000:80
kubectl -n observability port-forward svc/kube-prometheus-stack-prometheus 9090:9090

Grafana user is admin; password comes from the GRAFANA_ADMIN_PASSWORD Doppler secret or the workflow-provided fallback.

Operations

Scale workers by updating terraform.tfvars counts, IP lists, and VMID lists together. If node names or VMIDs change, also update the hard-coded retry cleanup target map in .gitea/workflows/deploy.yml.

Upgrade K3s by changing the role defaults in ansible/roles/k3s-server/defaults/main.yml and ansible/roles/k3s-agent/defaults/main.yml. Check Rancher chart compatibility before moving to a Kubernetes minor outside <1.35.0-0.

Destroy through the Gitea Destroy workflow with confirm: destroy, or locally with:

terraform -chdir=terraform destroy -var-file=../terraform.tfvars

Troubleshooting

Check K3s from cp1:

ssh ubuntu@10.27.27.30 'sudo k3s kubectl get nodes -o wide'
ssh ubuntu@10.27.27.30 'sudo journalctl -u k3s -n 120 --no-pager'

Check Flux and Rancher:

kubectl -n flux-system get gitrepositories,kustomizations,helmreleases,ocirepositories
kubectl -n flux-system describe helmrelease rancher
kubectl -n cattle-system get pods,deploy -o wide

Check Tailscale services:

kubectl -n tailscale-system get pods
kubectl -n cattle-system get svc rancher-tailscale
kubectl -n observability get svc grafana-tailscale prometheus-tailscale
kubectl -n cattle-system describe svc rancher-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc grafana-tailscale | grep TailscaleProxyReady
kubectl -n observability describe svc prometheus-tailscale | grep TailscaleProxyReady

If local kubectl falls back to localhost:8080, refresh outputs/kubeconfig with scripts/refresh-kubeconfig.sh 10.27.27.30.

Security Notes

  • Never commit terraform.tfvars, kubeconfigs, private keys, outputs/, or real secret values.
  • Terraform/bootstrap/CI secrets stay in Gitea Actions secrets.
  • Runtime cluster secrets are sourced from Doppler through External Secrets.
  • This repo does not manage Proxmox/LAN firewalls or public ingress.

License

MIT

S
Description
No description provided
Readme 3.3 MiB
Languages
Go Template 81.5%
HCL 5.5%
Python 4.7%
Shell 4.3%
Jinja 4%