Deploying a Highly Available Kubernetes Cluster on Proxmox with Terraform and Talos OS

2025-08-02 2495 words 12 minutes

Contents

A highly available Kubernetes cluster in a homelab setup creates opportunities to test distributed systems, automation, and failure recovery under real-world conditions. This guide walks through one approach to building such a cluster using Proxmox for virtualization, Terraform for provisioning, and Talos OS for running the Kubernetes nodes.

This setup provides declarative infrastructure and immutable operating systems, eliminating the need for traditional Linux administration—no SSH, no shell, and no drifting configuration. The result is a consistent, secure, and maintainable cluster architecture suitable for long-term experimentation or light production use.

All the code used in this guide, including Terraform modules, Talos patches, and Ansible playbooks, is available in the following repository: GitHub - blog-resources. This makes it easier to follow along or adapt the configuration to your own environment.

Why Run Kubernetes on Proxmox

Proxmox provides a powerful abstraction over hardware. It allows centralized management of virtual machines with native support for snapshots, backups, and live migration. These capabilities simplify maintenance workflows—especially for multi-node Kubernetes clusters where availability and recovery matter.

Virtualization also decouples hardware from workload placement. A Kubernetes node can be relocated across physical hosts or replaced entirely without touching internal configuration. In heterogeneous environments or when gradually expanding a homelab, this flexibility reduces friction.

Networking in Proxmox supports complex topologies without needing to rewire hardware or manually configure Linux networking. Virtual machines can attach to multiple NICs, bridges, or tagged VLAN interfaces, making it easier to separate control-plane, storage, and ingress traffic at the infrastructure level.

This setup combines infrastructure abstraction from Proxmox, provisioning logic from Terraform, and cluster integrity from Talos—offering a clean and automated deployment path for a resilient Kubernetes environment.

What Talos OS Brings to Kubernetes

Talos is a Kubernetes-native operating system that emphasizes simplicity, security, and immutability. Unlike traditional Linux distributions, Talos removes the user-facing shell entirely—there is no SSH, no shell, and no package manager. All system interaction happens through a dedicated API using talosctl, which connects securely to a node’s management endpoint.

This architecture brings several advantages. Talos is minimal by design and loads into memory as an ephemeral OS. Because it treats the OS and Kubernetes as a single unit, there is no need to manage these components independently. The entire lifecycle of a node—from configuration to upgrades—is declarative, API-driven, and reproducible.

Immutability eliminates configuration drift. Nodes do not accept runtime changes outside of the declared configuration files. This aligns well with infrastructure-as-code workflows and prevents untracked state changes. Talos supports a wide range of platforms—including bare metal, Proxmox, public clouds, and nested VM environments—making it versatile for consistent deployments.

Talos also integrates security into its core. With no exposed SSH ports or unnecessary services, its attack surface is smaller than general-purpose operating systems. All configuration is encrypted and authenticated, and system extensions allow adding only explicitly whitelisted functionality.

Operating and upgrading a Talos-based cluster becomes predictable and standardized. Configuration is stored in YAML and applied via the Talos API. Bootstrapping Kubernetes is built into the OS, eliminating the need for kubeadm or external provisioning tools.

There are trade-offs. Without SSH access, node-level debugging requires a different approach. Most troubleshooting must be done via talosctl, and more advanced inspection may involve kubectl debug to launch temporary containers. While Talos provides strong guarantees around automation and consistency, its documentation is still evolving and may not always cover edge cases.

Talos fits well for those who value repeatability, minimal drift, and secure automation in Kubernetes operations. It replaces traditional systems management with modern, declarative tooling suited for fully automated infrastructure.

Creating a Proxmox Cluster

Proxmox clusters consist of multiple nodes joined under a shared management layer. Each node runs the Proxmox VE hypervisor and communicates over a trusted control network. The cluster configuration allows scheduling virtual machines across nodes, performing HA failover, and managing resources from a unified UI or API.

Before creating a cluster, each node must be installed with Proxmox VE and connected to a common management network. If the initial setup of Proxmox is still pending, including basic installation and configuration of a single node, see the previous guide: Building a Home Virtualization Server with Proxmox.

Once the base system is installed, one node initializes the cluster through the web interface. From the primary node, navigate to the Datacenter view, open the Cluster tab, and select Create Cluster. After creation, Proxmox provides a join command with embedded tokens and cluster secrets.

Each additional node joins the cluster by opening the same Cluster menu, selecting Join Cluster, and pasting the previously generated join string. All nodes must be in a clean state before joining—no existing VMs, templates, or custom storage pools configured. After the join process completes, the nodes appear under a single Datacenter view in the Proxmox UI, ready for workload distribution and shared management.

Provisioning Kubernetes Nodes with Terraform

With the Proxmox cluster in place, the next step provisions the virtual machines that will serve as Kubernetes control plane and worker nodes. Terraform handles this workflow using the Telmate Proxmox provider, allowing infrastructure to be described and versioned as code.

Configuring Terraform Access to Proxmox

Proxmox requires an API token for Terraform to authenticate. This token is created under Datacenter → Permissions → API Tokens for the root@pam user. Privilege Separation remains disabled for full access.

Store the credentials in a file named secrets.auto.tfvars:

1
2
3
4


pm_api_url = "https://pve1.lab:8006/api2/json"
pm_username = "root@pam"
pm_api_token_id = "terraform"
pm_api_token_secret = "your_api_token_secret"

Configure the provider in your main Terraform module:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


terraform {
  required_providers {
    proxmox = {
      source = "Telmate/proxmox"
      version = "3.0.2-rc03"
    }
  }
}

provider "proxmox" {
  pm_api_url          = var.pm_api_url
  pm_user             = var.pm_username
  pm_api_token_id     = var.pm_api_token_id
  pm_api_token_secret = var.pm_api_token_secret
}

Declaring Control Plane Node Configuration

Each Talos control plane node is described using structured variables. The configuration includes Proxmox host assignment, VM IDs, resource allocation, and network interface definitions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


variable "talos_control_configuration" {
  type = list(object({
    pm_node   = string
    vmid      = number
    vm_name   = string
    cpu_cores = number
    memory    = number
    disk_size = string
    networks  = list(object({
      id      = number
      macaddr = string
      tag     = number
    }))
  }))
}

An example list defines three nodes across different Proxmox hosts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


talos_control_configuration = [
  {
    pm_node   = "pve1"
    vmid      = 1110
    vm_name   = "talos-control-1"
    cpu_cores = 2
    memory    = 4096
    disk_size = "50G"
    networks = [
      { id = 0, macaddr = "BC:24:13:5F:39:D1", tag = 70 }
    ]
  },
  {
    pm_node   = "pve2"
    vmid      = 2110
    vm_name   = "talos-control-2"
    cpu_cores = 2
    memory    = 4096
    disk_size = "50G"
    networks = [
      { id = 0, macaddr = "BC:24:13:5F:39:D2", tag = 70 }
    ]
  },
  {
    pm_node   = "pve3"
    vmid      = 3110
    vm_name   = "talos-control-3"
    cpu_cores = 2
    memory    = 4096
    disk_size = "50G"
    networks = [
      { id = 0, macaddr = "BC:24:13:5F:39:D3", tag = 70 }
    ]
  }
]

Creating Talos VMs

The virtual machines are provisioned with a single proxmox_vm_qemu resource that loops through the list of control node definitions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


resource "proxmox_vm_qemu" "talos_control" {
  for_each = { for index, config in var.talos_control_configuration : config.vmid => config }

  target_node = each.value.pm_node
  vmid        = each.value.vmid
  name        = each.value.vm_name
  description = "Talos control plane node ${each.value.vmid}"
  agent       = 1
  onboot      = true
  vm_state    = "running"

  memory = each.value.memory
  cpu {
    cores   = each.value.cpu_cores
    sockets = 1
    type    = "host"
  }

  ipconfig0 = "ip=dhcp"
  skip_ipv6 = true

  dynamic "network" {
    for_each = each.value.networks
    content {
      id      = network.value.id
      model   = "virtio"
      bridge  = "vmbr0"
      macaddr = network.value.macaddr
      tag     = network.value.tag
    }
  }

  scsihw = "virtio-scsi-single"
  boot   = "order=scsi0;ide2"
  disks {
    scsi {
      scsi0 {
        disk {
          storage = "local-lvm"
          size    = each.value.disk_size
        }
      }
    }
    ide {
      ide2 {
        cdrom {
          iso = var.talos_iso_file
        }
      }
    }
  }

  lifecycle {
    ignore_changes = [disk, vm_state]
  }

  tags = "kubernetes,control"
}

After defining the resources, the following commands apply the configuration:

1
2
3
4


pushd terraform
terraform plan -var-file=values.tfvars -out=create-plan
terraform apply create-plan
popd

Installing Talos and Bootstrapping Kubernetes with Ansible

Talos replaces the traditional Linux node installation and configuration workflow with an API-driven model. The operating system is deployed by applying declarative configuration files through talosctl, and the entire lifecycle—including Kubernetes bootstrap, OS upgrades, and optional extensions—is managed without direct access to the host.

Playbook is not idempotent, and should be executed only once for cluster creation.

A set of variables defines the layout of the cluster, including the IPs of control and worker nodes, the cluster name, and where configuration files are rendered.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


vars:
  cluster_name: "test-cluster"
  output_directory: "./rendered"
  talos_configuration_location: "rendered/talosconfig"
  talos_control_ip: "192.168.70.191"
  talos_worker_ips:
    - "192.168.70.194"
    - "192.168.70.195"
    - "192.168.70.196"
  talos_control_ips:
    - "192.168.70.191"
    - "192.168.70.192"
    - "192.168.70.193"

Cluster Secrets Generation

Talos clusters rely on a secrets bundle to secure internal communication and encrypt sensitive resources. These secrets are generated once using talosctl gen secrets. The playbook checks for the presence of secrets.yaml, and if it doesn’t exist, it creates it:

1
2
3
4
5
6
7
8


- name: Check if secrets file exists
  stat:
    path: "secrets.yaml"
  register: result

- name: Generate cluster secrets
  command: talosctl gen secrets
  when: not result.stat.exists

The generated file must remain consistent for the lifetime of the cluster.

Generating Talos Configuration with Patches

The Talos cluster is configured using base configuration files created with talosctl gen config, which produces control plane and worker node definitions.

Two patch files are applied to customize the generated YAML:

common.yaml contains shared configuration for all nodes. It disables predictable interface naming, enables server certificate rotation, and installs the Kubelet Serving Certificate Approver:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# common.yaml
---
machine:
  install:
    extraKernelArgs:
      - net.ifnames=0
  kubelet:
    extraArgs:
      rotate-server-certificates: true

cluster:
  extraManifests:
    - https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml

proxmox-machine.yaml ensures compatibility with Proxmox by explicitly specifying the disk device:

1
2
3
4
5


# proxmox-machine.yaml
---
machine:
  install:
    disk: /dev/sda

The playbook applies these patches in order. First, the base files are created:

1
2
3
4
5
6
7


- name: Generate base configuration files
  command: >
    talosctl gen config --force {{ cluster_name }}
    "https://{{ talos_control_ip }}:6443"
    --with-secrets secrets.yaml
    --config-patch @../talos/patches/common.yaml
    --output {{ output_directory }}/

Then they are patched individually for control and worker nodes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


- name: Generate proxmox control configuration
  command: >
    talosctl machineconfig patch {{ output_directory }}/controlplane.yaml
    --patch @../talos/patches/proxmox-machine.yaml
    -o {{ output_directory }}/proxmox-control.yaml

- name: Generate proxmox worker configuration
  command: >
    talosctl machineconfig patch {{ output_directory }}/worker.yaml
    --patch @../talos/patches/proxmox-machine.yaml
    -o {{ output_directory }}/proxmox-worker.yaml

Applying Configuration to Nodes

Before configuration can be applied, the local talosctl context must point to the control plane endpoint:

1
2
3
4


- name: Update talos configuration file
  shell: >
    talosctl config --talosconfig {{ talos_configuration_location }}
    endpoint {{ talos_control_ip }}

Configuration is applied to each node using talosctl apply-config. Talos installs the OS image on the disk and reboots automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


- name: Apply configuration file to control nodes
  shell: >
    talosctl apply-config --talosconfig {{ talos_configuration_location }}
    -f {{ output_directory }}/proxmox-control.yaml -n {{ item }} --insecure
  with_items: "{{ talos_control_ips }}"

- name: Apply configuration file to worker nodes
  shell: >
    talosctl apply-config --talosconfig {{ talos_configuration_location }}
    -f {{ output_directory }}/proxmox-worker.yaml -n {{ item }} --insecure
  with_items: "{{ talos_worker_ips }}"

- name: Wait until configuration files are applied
  pause:
    minutes: 5

Bootstrapping the Kubernetes Control Plane

The first control plane node must initialize the etcd datastore and bring up the Kubernetes control plane. This operation needs to be performed exactly once in the cluster’s lifecycle.

1
2
3
4
5
6
7
8


- name: Bootstrap etcd server node
  shell: >
    talosctl bootstrap --talosconfig {{ talos_configuration_location }}
    -n {{ talos_control_ip }} -e {{ talos_control_ip }}

- name: Wait until bootstrap is completed
  pause:
    minutes: 5

After bootstrapping, the control plane becomes active and accessible over the configured endpoint.

Installing System Extensions via Talos Factory

To enable QEMU guest integration, a Talos image must be built with the required system extension. This is defined in the customization.yaml file:

1
2
3
4
5


# customization.yaml
customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/qemu-guest-agent

The playbook sends this definition to the Talos Factory API and retrieves a custom image ID:

1
2
3
4
5


- name: Get extensions image id
  shell: >
    curl -X POST --data-binary @../talos/customization.yaml
    https://factory.talos.dev/schematics | jq '.id'
  register: image_id

An image URL is constructed:

1
2
3


- name: Set image url
  set_fact:
    image_url: "factory.talos.dev/nocloud-installer/{{ image_id.stdout }}:v1.10.5"

Then the image is applied to each control and worker node using talosctl upgrade. This operation is blocking and there is no need for artificial wait commands.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


- name: Upgrade image for control nodes
  shell: >
    talosctl upgrade --talosconfig {{ talos_configuration_location }}
    -n {{ item }} -i {{ image_url }} -e {{ talos_control_ip }} --preserve
  with_items: "{{ talos_control_ips }}"

- name: Upgrade image for worker nodes
  shell: >
    talosctl upgrade --talosconfig {{ talos_configuration_location }}
    -n {{ item }} -i {{ image_url }} -e {{ talos_control_ip }} --preserve
  with_items: "{{ talos_worker_ips }}"

Retrieving Kubernetes Kubeconfig

Once Talos has bootstrapped the cluster and all nodes are healthy, the kubeconfig can be pulled from the control plane:

1
2
3
4


- name: Retrieve kubectl file
  shell: >
    talosctl kubeconfig --talosconfig {{ talos_configuration_location }}
    -n {{ talos_control_ip }} -e {{ talos_control_ip }} .

Execute the playbook.

1
2
3


pushd ansible
ansible-playbook install.yaml
popd

This configuration file grants administrative access to the Kubernetes cluster via kubectl. Verifying the cluster is straightforward:

1
2
3


export KUBECONFIG=`pwd`/ansible/kubeconfig
kubectl get nodes
kubectl get pods -n kube-system

Final Thoughts

This architecture combines the flexibility of virtualized infrastructure with the safety and consistency of an immutable operating system. Each layer—Proxmox, Terraform, and Talos—plays a distinct role in building a repeatable, declarative, and resilient Kubernetes cluster.

Talos enforces a disciplined operational model. It removes many debugging conveniences and replaces them with structured APIs and tools. In return, it offers reproducibility, minimal drift, and deep integration with Kubernetes as a system.

This workflow creates a stable foundation for testing high-availability clusters, experimenting with upgrades, or hosting lightweight production workloads in a homelab.

All code used in this guide is available at:

GitHub - blog-resources.

Happy engineering!