Post

She May Not Be Pretty But This Rack Saved My Business 150K This Year

She May Not Be Pretty But This Rack Saved My Business 150K This Year

She May Not Be Pretty But This Rack Saved My Business 150K This Year

1. INTRODUCTION

The cloud cost explosion is real. When your Kubernetes cluster grows to hundreds of nodes and your database handles millions of daily transactions, the cloud bill can quickly become your largest operational expense. This is the story of how a battle-tested homelab infrastructure - housed in an unassuming server rack - delivered 600+ production Kubernetes pods and handled intense database workloads while saving over $150,000 annually compared to equivalent cloud infrastructure.

For DevOps engineers and system administrators managing resource-intensive workloads, the cloud vs. on-premises decision isn’t binary. As demonstrated by real-world numbers from a 2-person operation:

  • AWS Cloud Cost: $180,000/year
  • Homelab OPEX: $12,000/year
  • Hardware CAPEX: ~$30,000 (mostly in 2024)

Even when accounting for $1,500/month electricity and $500/month ISP costs, the savings remain compelling. This guide explores how modern self-hosted infrastructure, when properly architected, can deliver enterprise-grade reliability at a fraction of cloud costs while maintaining DevOps best practices.

You’ll learn:

  1. Hardware selection strategies for high-density Kubernetes workloads
  2. Kubernetes optimization techniques for bare metal
  3. Database performance tuning for write-heavy workloads
  4. Cost analysis methodologies for CAPEX/OPEX comparisons
  5. Failure domain management in self-hosted environments

2. UNDERSTANDING HOMELAB INFRASTRUCTURE FOR PRODUCTION WORKLOADS

What is Production-Grade Homelab Infrastructure?

A homelab transitions from hobby project to production infrastructure when it meets three criteria:

  1. Availability Requirements: 99.9%+ uptime for business-critical services
  2. Performance Demands: Sustained throughput matching cloud equivalents
  3. Operational Rigor: Enterprise-grade monitoring, backups, and security

Key Components of the Cost-Saving Rack

The referenced $150k-saving infrastructure consists of:

ComponentSpecificationQuantityPurpose
EPYC 7742 Servers64C/128T, 512GB RAM3Kubernetes worker nodes
Storage Server24x NVMe in ZFS mirror1Database & persistent storage
Top-of-Rack Switch100Gbps L3 capability1Cluster networking
UPS System3000VA with NMC2Clean power delivery

Why This Beats Cloud for Specific Workloads

Database-Intensive Workloads:
Cloud databases charge premium prices for I/O operations. A single db.r6g.8xlarge instance (32vCPU, 256GB RAM) costs ~$2,300/month in AWS. Our NVMe-packed storage server handles equivalent workload for ~$200/month in electricity.

Kubernetes at Scale:
600 pods distributed across 3 EPYC servers (192 total vCPUs, 1.5TB RAM) would require ~20 c6g.8xlarge EC2 instances ($15,000+/month). Our bare metal solution: ~$300/month power.

Tradeoffs and Considerations

Pros:

  • 8-10x cost savings for I/O-heavy workloads
  • Full hardware visibility for performance tuning
  • No egress charges for data-intensive applications

Cons:

  • Upfront capital expenditure ($20k-$50k)
  • Requires specialized sysadmin skills
  • Limited geographic redundancy

3. PREREQUISITES

Hardware Requirements

Minimum Production Specification:

  • CPU: AMD EPYC 7002 series or Intel Xeon Scalable (Gen 2+)
  • RAM: 256GB ECC DDR4 per node minimum
  • Storage: NVMe with power-loss protection
  • Networking: 10Gbps+ switching with LACP support

Software Stack

ComponentVersionNotes
Kubernetes1.27+Using k3s for lightweight overhead
Ceph17.2+Or Longhorn for distributed storage
Prometheus2.45+With VictoriaMetrics for long-term storage
PostgreSQL15.xTimescaleDB for time-series workloads

Network Considerations

  1. Dedicated Subnets:
    Separate VLANs for:
    • Kubernetes pod network
    • Storage replication
    • Management interface
  2. Bandwidth Planning:
    1
    2
    
    Required Bandwidth = (Peak IOPS * Avg I/O Size) / 0.8 (TCP overhead)
    Example: 50,000 IOPS * 16KB = 800MB/s → 6.4Gbps → 10Gbps NIC minimum
    

Pre-Installation Checklist

  1. Validate RAM ECC functionality
  2. Burn-in test all SSDs (badblocks -wsv /dev/nvme0n1)
  3. Configure IPMI/BMC for out-of-band management
  4. Setup UPS network shutdown triggers
  5. Document physical rack locations and power circuits

4. INSTALLATION & SETUP

Base OS Configuration (Ubuntu 22.04 LTS)

Kernel Tuning for Database Servers:

1
2
3
4
5
# /etc/sysctl.d/99-production.conf
vm.swappiness=1
vm.dirty_ratio=40
vm.dirty_background_ratio=10
kernel.numa_balancing=0

Disable Transparent Hugepages (THP):

1
2
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Kubernetes Cluster Setup with k3s

Control Plane Initialization:

1
2
3
4
5
6
7
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.27.4+k3s1 \
  sh -s - server \
  --cluster-init \
  --disable servicelb \
  --disable traefik \
  --write-kubeconfig-mode 644 \
  --kubelet-arg="cpu-manager-policy=static"

Worker Node Join:

1
2
3
4
5
K3S_TOKEN=SECRET_TOKEN K3S_URL=https://control-plane:6443 \
  curl -sfL https://get.k3s.io | \
  INSTALL_K3S_VERSION=v1.27.4+k3s1 \
  sh -s - agent \
  --kubelet-arg="cpu-manager-policy=static"

Storage Configuration

ZFS Pool Creation for NVMe Storage:

1
2
3
4
5
zpool create -f tank mirror /dev/disk/by-id/nvme-SAMSUNG_MZQL27T6HBLA-000AA_XXXXXX \
                          mirror /dev/disk/by-id/nvme-SAMSUNG_MZQL27T6HBLA-000AA_YYYYYY
zfs set compression=zstd-9 tank
zfs set atime=off tank
zfs set recordsize=1M tank/db

Network Fabric Setup

MetalLB Layer 2 Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# metallb-config.yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: production-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.10.100-192.168.10.200
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: l2-advert
  namespace: metallb-system

5. CONFIGURATION & OPTIMIZATION

Kubernetes Scheduler Tuning

Pod Affinity for NUMA Locality:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# numa-affinity.yaml
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - numa0
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - high-io-db
        topologyKey: kubernetes.io/hostname

Database Performance Tuning

PostgreSQL 15 Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
# postgresql.conf
shared_buffers = 64GB
effective_cache_size = 192GB
work_mem = 1GB
maintenance_work_mem = 2GB
max_worker_processes = 64
max_parallel_workers_per_gather = 16
max_parallel_workers = 48
wal_buffers = 16MB
wal_compression = on
bgwriter_delay = 10ms
bgwriter_lru_multiplier = 2.0

Power Efficiency Measures

CPU Governor Settings:

1
2
3
4
5
# Set performance governor for database nodes
cpupower frequency-set -g performance

# Set ondemand governor for worker nodes
cpupower frequency-set -g ondemand

6. USAGE & OPERATIONS

Daily Monitoring Checklist

  1. Cluster Health:
    1
    
    kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
    
  2. Storage Utilization:
    1
    2
    
    zpool list -v
    zfs list -t snapshot
    
  3. Database Performance:
    1
    2
    
    SELECT * FROM pg_stat_activity WHERE state != 'idle';
    SELECT * FROM pg_stat_bgwriter;
    

Backup Strategy

Velero Configuration for Kubernetes State:

1
2
3
4
5
6
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.7.0 \
  --bucket velero-backups \
  --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.local:9000 \
  --use-restic

PostgreSQL Continuous Archiving:

1
2
# wal-g backup-push /var/lib/postgresql/15/main
wal-g backup-fetch LATEST

7. TROUBLESHOOTING

Common Hardware Issues

Diagnosing Memory Errors:

1
2
3
4
5
# Check for correctable ECC errors
ipmitool sel list -c memory

# Test memory stability
memtester 16G 12

Storage Degradation:

1
2
3
4
5
# ZFS scrub status
zpool status -v

# NVMe health check
nvme smart-log /dev/nvme0

Kubernetes Debugging

Pod Failure Diagnosis:

1
2
3
kubectl describe pod $CONTAINER_ID
kubectl logs $CONTAINER_ID --previous
kubectl debug -it $CONTAINER_ID --image=busybox

Network Connectivity Issues:

1
2
3
4
5
# Check kube-proxy rules
iptables-save | grep <service-ip>

# Validate Calico BGP status
calicoctl node status

8. CONCLUSION

This homelab infrastructure demonstrates that with careful planning and technical execution, self-hosted environments can deliver cloud-scale performance at dramatically lower costs. The $150k annual savings comes not just from avoiding cloud markup, but from:

  1. Hardware Optimization: Matching exact workload requirements
  2. Software Tuning: Eliminating virtualization overhead
  3. Operational Discipline: Proactive monitoring and maintenance

For teams running I/O-intensive or high-throughput workloads, a hybrid approach using carefully selected bare metal infrastructure can provide the best balance of cost, performance, and control. While cloud services remain essential for global distribution and managed services, strategic use of self-hosted infrastructure creates significant competitive advantage through cost optimization.

Further Reading:

  1. Kubernetes Production Best Practices
  2. ZFS Tuning Guide
  3. PostgreSQL Optimization

When your cloud bill starts approaching six figures, it’s time to ask: could the answer be in your own rack?

This post is licensed under CC BY 4.0 by the author.