Post

Rolling Out A New Home Datacenter

Rolling Out A New Home Datacenter

Rolling Out A New Home Datacenter

Introduction

The modern home datacenter represents the ultimate proving ground for infrastructure engineers. When a Reddit user recently shared specs for their 11-node EPYC-powered homelab - complete with 100Gbe networking and custom UPS infrastructure - it sparked both admiration and curiosity. This 12kW beast (costing more than some cars) exemplifies how advanced self-hosted infrastructure has become.

For DevOps professionals and system administrators, home datacenters serve multiple critical functions:

  1. Risk-Free Experimentation: Test cutting-edge technologies without enterprise constraints
  2. Career Development: Master infrastructure automation at scale
  3. Cost Optimization: Replace cloud expenses with owned hardware
  4. Specialized Workloads: Run high-performance computing tasks impractical in shared environments

This guide dissects the complete lifecycle of deploying professional-grade infrastructure in residential environments. We’ll cover:

  • Hardware selection balancing performance and power efficiency
  • Enterprise-grade virtualization and orchestration
  • Network architecture for multi-tenant workloads
  • Power and cooling considerations for dense deployments
  • Operational practices borrowed from hyperscale environments

By the end, you’ll understand how to design, deploy, and maintain infrastructure that rivals commercial offerings - all within your own four walls.

Understanding Home Datacenter Fundamentals

What Constitutes a Modern Home Datacenter?

Unlike traditional “homelabs” consisting of repurposed consumer hardware, contemporary home datacenters implement enterprise patterns at reduced scale:

Core Components:

  • Compute Nodes: EPYC/Ryzen or Xeon Scalable processors with ECC RAM
  • Storage: NVMe-backed Ceph clusters or ZFS arrays
  • Networking: 25/100Gbe spine-leaf topologies
  • Power: Dual-conversion UPS with generator backup

Functional Requirements:

  • Hyperconverged infrastructure capabilities
  • Infrastructure-as-Code (IaC) management
  • Observability pipeline with metrics/logs/tracing
  • Automated recovery from hardware failures

Evolution of Residential Infrastructure

The home datacenter movement has progressed through distinct phases:

EraTypical HardwareKey Enablers
2000-2010Decommissioned enterprise gearVMware ESXi Free, Proxmox VE
2010-2018Custom whitebox serversKubernetes, OpenStack
2018-PresentARM SBC clusters + HEDT systemsMicroservers, 100Gbe NICs

This evolution parallels cloud infrastructure development, enabling individuals to implement patterns like:

  • GitOps: ArgoCD managing cluster state
  • Service Meshes: Istio/Linkerd for internal traffic
  • HPC Workloads: MPI clusters for scientific computing

Performance/Cost Analysis

The referenced Reddit configuration demonstrates professional-grade capabilities:

1
2
3
4
2x AMD EPYC 9B14 (96C/192T)
768GB DDR5 @ 4800MHz
4x 4TB U.2 NVMe
2x 100Gbe ConnectX-4

Performance Considerations:

  • Throughput: 200Gbe per node enables NVMe-oF/RDMA workloads
  • Compute Density: 2,112 vCPUs per rack (11 nodes × 192 threads)
  • Storage: 176TB raw NVMe (11 × 16TB) with 1M+ IOPS capability

Cost Drivers:

  • DDR5 RDIMMs at ~$800 per 64GB module
  • Enterprise U.2 SSDs at ~$1,500 per 4TB
  • 100Gbe NICs at ~$800/port

Use Case Spectrum

Contrary to “Pi-Hole” jokes, serious home datacenters typically host:

  1. Hyperconverged Infrastructure
    Proxmox/Ceph clusters providing VM + block storage services

  2. Machine Learning
    Kubeflow pipelines with GPU/TPU acceleration

  3. Media Processing
    FFmpeg transcoding farms for 8K video

  4. Development Environments
    Ephemeral Kubernetes namespaces per developer

  5. Security Labs
    Network intrusion detection with Suricata/Zeek

Prerequisites

Hardware Requirements

For a production-equivalent deployment:

Minimum Specifications:

  • CPU: AMD EPYC 7003/9004 or Intel Xeon Scalable (16C+)
  • RAM: 256GB ECC DDR4/DDR5 per node
  • Storage: 2x NVMe (OS) + 4-8x SSD/NVMe (data)
  • Networking: Dual 25Gbe+ NICs (SFP28/QSFP28)

Power Infrastructure:

  • UPS: Double-conversion >10kVA (e.g., Eaton 9PX)
  • PDU: Metered/switchable 30A+
  • Circuit: Dedicated 240V L6-30R

Cooling:

  • Per-rack heat load calculation: BTU/hr = Total Watts × 3.41
  • Example: 12kW × 3.41 = 40,920 BTU/hr (requires 3.5-ton AC)

Software Requirements

Base OS:

  • Proxmox VE 8.x
  • Rocky Linux 9.x
  • Ubuntu Server 22.04 LTS

Orchestration:

  • Kubernetes 1.28+ (with Cilium CNI)
  • HashiCorp Nomad 1.6+

Management:

  • Terraform 1.5+ with Libvirt/Proxmox provider
  • Ansible Core 2.14+

Network Planning

VLAN Architecture:

1
2
3
4
VLAN 10: Management (SSH/API)  
VLAN 20: Storage (iSCSI/Ceph)  
VLAN 30: VM Data  
VLAN 40: DMZ Services  

Firewall Rules:

  • Default deny all
  • Whitelist intra-cluster communication
  • Rate limit public services

Pre-Installation Checklist

  1. Validate hardware compatibility (PCIe bifurcation, DIMM population)
  2. Update all BMC/IPMI firmware
  3. Burn-in test components (72h memtest86+, fio storage tests)
  4. Document MAC addresses for DHCP reservations
  5. Configure switch ports (MTU 9000, LACP)

Installation & Configuration

Bare Metal Provisioning

PXE Boot Infrastructure:

1
2
3
4
5
# Configure dnsmasq for PXE
dhcp-range=192.168.1.50,192.168.1.150,12h
dhcp-boot=pxelinux.0,pxeserver,192.168.1.10
enable-tftp
tftp-root=/var/lib/tftpboot

IPMI Configuration:

1
2
3
4
# Set BMC credentials on Supermicro nodes
ipmitool -I lanplus -H $BMC_IP -U ADMIN -P ADMIN user set name 2 admin
ipmitool -I lanplus -H $BMC_IP -U ADMIN -P ADMIN user set password 2 $STRONG_PASSWORD
ipmitool -I lanplus -H $BMC_IP -U admin -P $STRONG_PASSWORD sol enable

Hypervisor Deployment

Proxmox VE Installation:

1
2
3
4
5
6
7
8
9
# Download ISO
wget https://enterprise.proxmox.com/iso/proxmox-ve_8.0-2.iso

# Create ZFS RAID10
zpool create -f -o ashift=12 tank \
  mirror /dev/disk/by-id/nvme-Samsung_SSD_1 \
         /dev/disk/by-id/nvme-Samsung_SSD_2 \
  mirror /dev/disk/by-id/nvme-Samsung_SSD_3 \
         /dev/disk/by-id/nvme-Samsung_SSD_4

Kernel Parameter Tuning:

1
2
3
# /etc/kernel/cmdline
root=ZFS=tank/rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt 
cgroup_enable=memory swapaccount=1 mitigations=off

Cluster Orchestration

Kubernetes Bootstrap with kubeadm:

1
2
3
4
5
6
7
8
9
10
11
12
# Install prerequisites
apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list

# Install components
apt-get update && apt-get install -y kubelet=1.28.5-00 kubeadm=1.28.5-00 kubectl=1.28.5-00

# Initialize control plane
kubeadm init --pod-network-cidr=10.244.0.0/16 \
  --control-plane-endpoint=cluster.home.datacenter:6443 \
  --upload-certs

Cilium CNI Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: clustermesh
spec:
  endpointSelector: {}
  egress:
  - toEntities:
    - cluster
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system

Network Fabric Implementation

FRRouting BGP Configuration:

1
2
3
4
5
6
7
8
9
10
11
! /etc/frr/frr.conf
router bgp 64512
 bgp router-id 192.168.100.1
 neighbor SPINE peer-group
 neighbor SPINE remote-as 64512
 neighbor 192.168.100.2 peer-group SPINE
 neighbor 192.168.100.3 peer-group SPINE
 !
 address-family ipv4 unicast
  network 10.10.0.0/24
 exit-address-family

VXLAN Overlay:

1
2
3
4
5
6
7
8
9
10
# Create VXLAN interface
ip link add vxlan100 type vxlan \
  id 100 \
  local 192.168.100.1 \
  dev bond0 \
  dstport 4789

# Add to bridge
brctl addbr br-vxlan100
brctl addif br-vxlan100 vxlan100

Optimization & Tuning

Hardware-Specific Tuning

AMD EPYC Power Management:

1
2
3
4
5
6
7
# Set performance governor
cpupower frequency-set -g performance

# Disable C-states
for i in $(ls -d /sys/devices/system/cpu/cpu*/cpuidle/state*); do 
  echo 1 > $i/disable
done

NVMe Optimization:

1
2
3
4
5
# Set I/O scheduler
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth
nvme set-feature /dev/nvme0 -f 1 -v 0x00ff

Kubernetes Performance

Pod Density Optimization:

1
2
3
4
5
6
7
# kubelet-config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 500
kubeAPIQPS: 100
kubeAPIBurst: 200
serializeImagePulls: false

Container Runtime Tuning:

1
2
3
4
5
6
7
8
9
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "stargz"
  disable_snapshot_annotations = false

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

Security Hardening

SSH Bastion Configuration:

1
2
3
4
5
6
7
8
# /etc/ssh/sshd_config.d/99-hardened.conf
PermitRootLogin prohibit-password
PasswordAuthentication no
AllowAgentForwarding no
X11Forwarding no
MaxAuthTries 3
MaxSessions 2
ClientAliveInterval 300

Pod Security Policies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'secret'
  hostNetwork: false
  hostIPC: false
  hostPID: false

Operational Management

Infrastructure-as-Code Workflow

Terraform Proxmox Provider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
resource "proxmox_vm_qemu" "k8s_worker" {
  count       = 11
  name        = "worker-${count.index}"
  target_node = "proxmox01"

  clone = "ubuntu2204-template"

  cores   = 48
  sockets = 2
  memory  = 262144

  network {
    model  = "virtio"
    bridge = "vmbr0"
  }

  disk {
    storage = "nvme-pool"
    type    = "scsi"
    size    = "4T"
  }
}

Ansible Hardware Inventory:

1
2
3
4
5
6
7
8
9
10
11
12
13
# inventory/hardware.yml
all:
  children:
    epyc_nodes:
      hosts:
        node01:
          bmc_ip: 192.168.100.101
        node02:
          bmc_ip: 192.168.100.102
    storage_nodes:
      hosts:
        stor01:
          jbod_count: 2

Monitoring Stack

Prometheus Node Exporter:

1
2
3
4
5
6
7
8
9
10
11
12
# prometheus-node-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
spec:
  template:
    spec:
      containers:
      - args:
        - --web.listen-address=0.0.0.0:9100
        - --collector.textfile.directory=/var/lib/node_exporter
        - --collector.netdev.device-exclude=lo,veth.*
        - --collector.nvme

Grafana Dashboard Variables:

1
2
3
4
5
6
7
8
9
10
{
  "interval": "30s",
  "queries": [
    {
      "refId": "A",
      "expr": "sum(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (instance)",
      "legendFormat": ""
    }
  ]
}

Backup Strategy

Proxmox Backup Server:

1
2
# Create backup job

This post is licensed under CC BY 4.0 by the author.