Post

My 7-Node Proxmox Cluster Pfannkuchen 300 Threads 33Tb Ram And A Whole Lot Of Learning

My 7-Node Proxmox Cluster Pfannkuchen 300 Threads 33Tb Ram And A Whole Lot Of Learning

My 7-Node Proxmox Cluster Pfannkuchen: 300 Threads, 33TB RAM, and A Whole Lot Of Learning

Introduction

Building a homelab that scales to enterprise levels presents unique challenges that few virtualization enthusiasts ever encounter. When I decided to create “Pfannkuchen” – my 7-node Proxmox cluster with 300 threads and 33TB of RAM – I thought I was simply expanding my home infrastructure. What I discovered was a journey through the complexities of distributed systems, resource management, and the practical limitations of consumer-grade networking.

This isn’t just another homelab story. Pfannkuchen represents the intersection of enterprise-grade hardware and the realities of home networking, where budget constraints meet ambitious technical goals. Throughout this comprehensive guide, I’ll share the architecture decisions, configuration challenges, and hard-earned lessons that transformed a collection of servers into a cohesive, high-performance virtualization platform.

Whether you’re planning your first homelab or looking to scale your existing infrastructure, the insights from this 300-thread behemoth will help you make informed decisions about hardware selection, network design, and cluster management. Let’s dive into the architecture that makes Pfannkuchen tick.

Understanding Proxmox Virtualization

Proxmox VE (Virtual Environment) is an open-source server virtualization platform that combines two powerful virtualization technologies: KVM for full virtualization and LXC for container-based virtualization. Unlike many hypervisors that focus solely on virtual machines, Proxmox provides a unified management interface for both VMs and containers, making it particularly well-suited for complex homelab environments.

The platform’s cluster capabilities allow multiple physical hosts to work together as a single system, providing high availability, live migration, and centralized management. Each node in a Proxmox cluster maintains a copy of the cluster configuration, ensuring that the system remains operational even if individual nodes fail. This redundancy is crucial for production environments but adds complexity to the initial setup.

Proxmox’s web-based interface, built on the same technology stack as many enterprise management tools, provides intuitive access to complex virtualization features. The platform supports ZFS, Ceph, and various storage backends, giving administrators flexibility in how they architect their storage solutions. For Pfannkuchen, this meant I could leverage both local storage on some nodes and network-attached storage on others, creating a heterogeneous but functional cluster.

Hardware Architecture Overview

The Pfannkuchen cluster consists of seven distinct nodes, each serving a specific purpose in the overall architecture. The hardware selection was driven by availability, budget constraints, and the need for diverse capabilities across the cluster.

Node 1 and Node 3 form the backbone of the cluster, each equipped with dual Intel Xeon Gold 6226 processors providing 48 threads per node. With 768GB of RAM each, these nodes handle the most memory-intensive workloads. Both connect to a Dell PowerStore 1000T SAN via 10GbE networking, providing shared storage for critical VMs and containers.

Node 2 serves as a bridge between the high-performance nodes and more modest hardware. The Intel i7-14700 with 28 threads and 96GB of RAM connects to a Synology NAS via NFS, demonstrating how even mid-range hardware can contribute meaningfully to a heterogeneous cluster.

The remaining four nodes (Nodes 4-7) complete the cluster with varying specifications, creating a resource pool that can handle everything from lightweight services to demanding applications. This diversity allows for optimal workload distribution and provides redundancy across different hardware configurations.

Prerequisites and Planning

Before embarking on a cluster of this scale, several critical prerequisites must be addressed. The network infrastructure alone requires careful planning, as 300 threads and 33TB of RAM generate significant network traffic during normal operations.

Network Requirements:

  • 10GbE backbone between high-performance nodes
  • 1GbE connectivity for less demanding nodes
  • Dedicated management network for cluster communication
  • Proper VLAN segmentation for security

Storage Considerations:

  • Shared storage for VM disk images and ISO files
  • Local storage for node-specific configurations
  • Backup storage with sufficient capacity
  • Network-attached storage for less critical data

Power and Cooling:

  • Sufficient power distribution across circuits
  • UPS protection for all critical nodes
  • Adequate cooling for continuous operation
  • Monitoring for temperature and power consumption

Software Dependencies:

  • Proxmox VE 8.x or later
  • Compatible network switches and routers
  • Time synchronization (NTP) across all nodes
  • DNS resolution for cluster communication

Installation and Initial Setup

The installation process for a multi-node Proxmox cluster requires careful attention to detail and a systematic approach. Each node must be prepared individually before they can be joined into a cohesive cluster.

Node Preparation

Begin by installing Proxmox VE on each physical server. The installation process is straightforward but requires attention to storage configuration. For nodes with local storage, ensure the root filesystem is properly partitioned and has sufficient space for system files and VM images.

1
2
3
# Initial network configuration on each node
ip addr add 192.168.1.10/24 dev eno1
ip link set eno1 up

After the initial installation, update the system packages and install any necessary drivers for your specific hardware:

1
2
3
# Update Proxmox and install necessary packages
apt update && apt upgrade -y
apt install open-iscsi lvm2 bridge-utils

Cluster Formation

The cluster formation process requires careful sequencing to ensure all nodes can communicate properly. Start with the first node and configure it as the cluster master:

1
2
# Initialize the cluster on the first node
pvecm create pfannkuchen-cluster

Add subsequent nodes to the cluster, ensuring network connectivity between all members:

1
2
# Join additional nodes to the cluster
pvecm add 192.168.1.10

Storage Configuration

Configure shared storage for VM images and ISO files. For the Dell PowerStore integration:

1
2
3
# Create iSCSI target configuration
echo "InitiatorName=iqn.2005-10.org.debian:01:pfannkuchen" > /etc/iscsi/initiatorname.iscsi
systemctl restart open-iscsi

For NFS storage from the Synology NAS:

1
2
3
# Mount NFS share for storage
mkdir -p /mnt/nfs-storage
mount -t nfs 192.168.1.20:/volume1/proxmox /mnt/nfs-storage

Advanced Configuration and Optimization

With the basic cluster operational, the focus shifts to optimization and advanced configuration. This phase involves fine-tuning performance, implementing security measures, and establishing monitoring capabilities.

Network Optimization

The network configuration requires careful planning to handle the cluster’s traffic patterns. Implement VLANs for different traffic types:

1
2
3
4
5
6
7
8
# /etc/network/interfaces example
auto vmbr0
iface vmbr0 inet static
    address 192.168.1.10/24
    gateway 192.168.1.1
    bridge_ports eno1
    bridge_stp off
    bridge_fd 0

Configure separate networks for management, VM traffic, and storage:

1
2
3
# Create additional bridges for traffic segregation
brctl addbr vmbr1
ip addr add 10.0.0.10/24 dev vmbr1

Storage Performance Tuning

Optimize storage performance based on the underlying hardware:

1
2
3
# Configure LVM cache for frequently accessed data
lvcreate -L 100G -n cachepool pfannkuchen-vg
lvconvert --type cache-pool --cache-meta-device pfannkuchen-vg/cachepool pfannkuchen-vg/root

Implement ZFS compression and deduplication where appropriate:

1
2
3
# Configure ZFS compression
zfs set compression=lz4 storage-pool
zfs set dedup=on storage-pool

Resource Management

Configure resource pools and limits to prevent any single VM from consuming excessive resources:

1
2
3
# Create resource pool with limits
pvesh create /pools -poolid production -comment "Production workloads"
pvesh set /pools/production -maxcpu 48 -maxmem 512G

Set up CPU pinning for performance-critical VMs:

1
2
# Pin VM to specific CPU cores
qm set 100 -cpus 4 -cpulimit 4 -cpuaffinity 0-3

Daily Operations and Management

Operating a 7-node cluster requires established procedures for routine tasks and emergency situations. The complexity of Pfannkuchen demands both automation and careful manual oversight.

Routine Maintenance

Establish a maintenance schedule that minimizes disruption to running workloads:

1
2
# Live migration of VMs before maintenance
qm migrate 100 node2 --online

Regular health checks should be automated:

1
2
3
4
5
6
#!/bin/bash
# Health check script
for node in node1 node2 node3; do
    ping -c 3 $node
    ssh $node "systemctl status pve-cluster"
done

Backup Strategy

Implement a comprehensive backup strategy that accounts for the cluster’s scale:

1
2
3
4
5
6
# Automated VM backup script
#!/bin/bash
DATE=$(date +%Y%m%d)
for vmid in $(qm list | awk 'NR>1 {print $1}'); do
    qm backup $vmid /mnt/backups/$vmid-$DATE.tar --mode snapshot
done

Configure backup storage with sufficient capacity:

1
2
# Create backup storage configuration
pvesm add dir backup-storage --dir /mnt/backups --content images

Monitoring and Alerting

Set up comprehensive monitoring to track cluster health and performance:

1
2
3
# Install and configure monitoring tools
apt install netdata
systemctl enable netdata

Configure Proxmox’s built-in monitoring:

1
2
3
# Enable email notifications for critical events
pvecm expected 7
pvecm status

Troubleshooting and Common Issues

Even with careful planning, issues arise in complex systems. Understanding common problems and their solutions is essential for maintaining cluster stability.

Network Connectivity Issues

Network problems can affect cluster communication and VM performance:

1
2
# Diagnose network connectivity
mtr -wzcn 10 192.168.1.10

Check bridge configuration:

1
2
3
# Verify bridge status
brctl show
ip addr show vmbr0

Storage Performance Problems

Storage bottlenecks can severely impact VM performance:

1
2
3
# Monitor storage I/O
iostat -x 5
zpool iostat -v 5

Check for storage errors:

1
2
3
# Check storage health
smartctl -a /dev/sda
zpool status -v

High Availability Failures

HA failures require immediate attention:

1
2
3
# Check HA status
ha-manager status
ha-manager config

Review cluster logs:

1
2
3
# Check cluster logs
journalctl -u pve-cluster
grep -i error /var/log/syslog

Performance Optimization Techniques

Maximizing the performance of a 300-thread, 33TB RAM cluster requires understanding both hardware capabilities and software optimization techniques.

CPU Optimization

Fine-tune CPU allocation and scheduling:

1
2
# Optimize CPU scheduler
echo "performance" > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Configure CPU affinity for critical services:

1
2
# Set CPU affinity for Proxmox services
taskset -pc 0-7 $(pgrep pvedaemon)

Memory Management

Optimize memory usage across the cluster:

1
2
3
# Configure hugepages for better performance
sysctl vm.nr_hugepages=1024
echo "vm.nr_hugepages=1024" >> /etc/sysctl.conf

Monitor memory pressure:

1
2
3
# Check memory usage
free -h
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable)"

Storage Optimization

Implement storage tiering for optimal performance:

1
2
3
# Create storage tiers
lvcreate -L 200G -n ssd-tier pfannkuchen-vg
lvcreate -L 2T -n hdd-tier pfannkuchen-vg

Configure caching strategies:

1
2
3
4
# Set up write-back cache
dmsetup create cache \
    --table "0 $(blockdev --getsz /dev/pfannkuchen-vg/hdd-tier) cache \
    /dev/pfannkuchen-vg/ssd-tier /dev/pfannkuchen-vg/ssd-tier 256 512 1 writeback default 0"

Security Hardening

Security considerations become more complex with larger clusters. Implement comprehensive security measures to protect your infrastructure.

Network Security

Segment network traffic and implement firewall rules:

1
2
3
4
5
# Configure iptables rules
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 8006 -j ACCEPT
iptables -A INPUT -p tcp --dport 3260 -j ACCEPT
iptables -P INPUT DROP

Implement VLAN security:

1
2
3
# Configure VLANs for isolation
vconfig add eno1 10
vconfig add eno1 20

Access Control

Implement role-based access control:

1
2
# Create custom roles
pvesh create /access/roles -roleid vm-admin -privs "VM.PowerMgmt VM.Console VM.ConfigOptions VM.ConfigCDROM VM.ConfigCloudinit VM.Audit"

Configure two-factor authentication:

1
2
3
# Install and configure 2FA
apt install google-authenticator
google-authenticator --time-based --disallow-reuse --force

Audit Logging

Enable comprehensive audit logging:

1
2
3
4
# Configure auditd
apt install auditd
echo "-w /etc/pve -p wa" >> /etc/audit/rules.d/audit.rules
systemctl restart auditd

Conclusion

Building and operating Pfannkuchen, my 7-node Proxmox cluster with 300 threads and 33TB of RAM, has been an extraordinary learning experience. What began as an ambitious homelab project evolved into a comprehensive exploration of virtualization at scale, network architecture, and system administration best practices.

The journey taught me that successful cluster management requires more than just technical knowledge – it demands careful planning, continuous monitoring, and the willingness to adapt when things don’t go as expected. From network optimization to security hardening, every aspect of Pfannkuchen required thoughtful consideration and iterative improvement.

For those considering similar projects, remember that starting small and scaling gradually often proves more successful than attempting enterprise-level infrastructure from day one. The principles learned from managing Pfannkuchen – resource management, network design, and operational procedures – apply equally to smaller deployments, just with fewer nodes to manage.

The real value of this project lies not in the raw specifications but in the deep understanding of how distributed systems work together to provide reliable, high-performance virtualization services. Whether you’re building your first homelab or planning to scale your existing infrastructure, the lessons from Pfannkuchen will serve you well on your journey through the fascinating world of virtualization and cluster management.

For further learning, I recommend exploring the official Proxmox documentation, community forums, and experimenting with different configurations in your own environment. The world of virtualization is constantly evolving, and there’s always more to learn.

This post is licensed under CC BY 4.0 by the author.