Post

Finally Upgraded My Homelab

Finally Upgraded My Homelab

Introduction

For years, my homelab was powered by a humble 4-node Raspberry Pi cluster – an excellent platform for learning Kubernetes and Terraform fundamentals, but plagued by the inherent limitations of MicroSD storage. Random node failures, corruptions during power outages, and sluggish I/O performance created operational headaches that undermined my home automation reliability. When TrueNAS 25 arrived with native Kubernetes integration, it became clear: The SD card experiment had served its purpose. It was time for enterprise-grade infrastructure.

This guide documents my journey from Raspberry Pi nodes to a purpose-built homelab with TrueNAS Scale, Kubernetes, and robust storage. We’ll dissect:

  1. Hardware selection – Why the DeskPi RackMate T1 and 2.5G PoE networking?
  2. Storage architecture – Separating apps (SSD) from bulk storage (HDD)
  3. Software stack – TrueNAS Scale’s Kubernetes integration versus vanilla K8s
  4. Operational gains – How ZFS and hardware redundancy eliminated my SD card woes

Whether you’re battling unreliable nodes or planning a homelab overhaul, this deep-dive into infrastructure modernization delivers actionable insights for DevOps engineers and sysadmins managing self-hosted environments.

Understanding the Homelab Evolution

What Defines a Modern Homelab?

Homelabs have evolved from hobbyist playgrounds to production-like environments for testing cloud-native technologies. Key characteristics include:

  • Self-healing infrastructure: Kubernetes, Proxmox HA, ZFS redundancy
  • Automated provisioning: Terraform, Ansible, cloud-init
  • Persistent storage: NAS/SAN solutions replacing USB drives
  • Enterprise networking: VLANs, firewalls, load balancers

The Problem with SD Cards in Production

MicroSD cards – while convenient for Pi clusters – introduce critical failure points:

Failure ModeImpactMitigation in New Build
Write enduranceCorrupted root FSM.2 SSD with wear leveling
I/O bottlenecksSlow PVC provisioningNVMe-backed RWO volumes
Power loss corruptionManual fsck recoveryZFS transaction groups + UPS
Limited capacityFrequent garbage collection32TB raw storage + compression

Why TrueNAS Scale for Kubernetes?

TrueNAS SCALE (25.10 “Uyuni”) merges Linux-based NAS management with Kubernetes:

1
2
3
4
5
[TrueNAS SCALE Architecture]
  ├── Debian 12 Base
  ├── ZFS 2.2.2
  ├── Kubernetes 1.28
  └── Web UI → K3s API Proxy

Key Advantages:

  • Direct PV Provisioning: Create ZVOLs or datasets as PersistentVolumes
  • GPU Passthrough: Assign NVIDIA cards to worker nodes via UI
  • Integrated Load Balancer: MetalLB integration for on-prem services

Comparison to Alternatives:

SolutionStorageK8s ManagementLearning Curve
TrueNAS SCALENative ZFSGUI + CLIModerate
Rancher + HarvesterCephFull UIHigh
MicroK8s + OpenZFSManualCLI OnlyLow

Prerequisites

Hardware Specifications

My build targets 24/7 operation with <150W power draw:

1
2
3
4
5
6
7
8
9
10
11
12
# hardware.yml
motherboard: MSI MPG B650I Edge (DDR5, PCIe 5.0)
cpu: AMD Ryzen 5 7600 (6C/12T, 65W TDP)
ram: 64GB Kingston Fury DDR5-5200 (ECC Unbuffered)
storage:
  boot: 500GB WD Red SN700 NVMe
  apps: 2x2TB Samsung 870 QVO SATA SSD (Mirror)
  data: 4x8TB Seagate IronWolf HDD (RAIDZ1)
networking: 
  main: 2.5Gbps RJ45 (Intel I225-V)
  secondary: 1Gbps (Realtek RTL8125)
power: Corsair SF750 Platinum (750W)

Critical Considerations:

  • ECC RAM: Not officially supported on consumer AM5 but reduces ZFS memory errors
  • HDD Choice: CMR drives mandatory for ZFS (avoid SMR like WD Red SMR)
  • Power Supply: 80+ Platinum efficiency for 24/7 operation

Network Pre-Configuration

Before installing TrueNAS:

1
2
3
4
5
6
7
8
# On Ubiquiti UniFi Controller
# Create VLANs
configure
set interfaces switch switch0 vif 30 description "Storage"
set interfaces switch switch0 vif 30 address 10.10.30.1/24
set interfaces switch switch0 vif 40 description "Kubernetes"
set interfaces switch switch0 vif 40 address 10.10.40.1/24
commit

Software Requirements

ComponentVersionNotes
TrueNAS SCALE25.10Requires UEFI boot with secure boot off
Kubernetes1.28Via TrueNAS Apps (k3s)
Terraform1.8.2State stored on S3-compatible MinIO

Installation & Configuration

TrueNAS SCALE Base Setup

  1. Flash Installer:
    1
    2
    
    # Linux
    sudo dd if=truenas-scale-25.10.iso of=/dev/sdX bs=4M status=progress conv=fsync
    
  2. Web UI Initialization:
    • Set admin IP on dedicated VLAN (10.10.30.5/24)
    • Disable root password, enforce SSH key authentication:
      1
      
      ssh-copy-id -i ~/.ssh/truenas_ed25519 admin@10.10.30.5
      
  3. ZFS Pool Creation:
    1
    2
    3
    4
    5
    6
    7
    8
    
    # CLI alternative to UI
    zpool create -f -o ashift=12 tank raidz1 \
      /dev/disk/by-id/ata-ST8000VN004-2M2101_ABCD1234 \
      /dev/disk/by-id/ata-ST8000VN004-2M2101_EFGH5678 \
      /dev/disk/by-id/ata-ST8000VN004-2M2101_IJKL9012 \
      /dev/disk/by-id/ata-ST8000VN004-2M2101_MNOP3456
    zfs set compression=lz4 tank
    zfs set atime=off tank
    

Kubernetes via TrueNAS Apps

TrueNAS uses a modified k3s distribution:

  1. Enable Kubernetes:
    • System SettingsAppsEnable
    • Select VLAN 40 (10.10.40.0/24)
    • Configure Load Balancer IP Range (10.10.40.100-10.10.40.150)
  2. Storage Classes:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    # apps.yaml
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: truenas-iscsi-lz4
    provisioner: csi.truenas.com
    parameters:
      compression: lz4
      aclmode: passthrough
      sync: standard
    volumeBindingMode: WaitForFirstConsumer
    
  3. Node Configuration:
    1
    2
    3
    
    # Access k3s cluster
    truenas-cli kubernetes kubeconfig > ~/.kube/homelab-config
    kubectl get nodes -o wide
    

Terraform Bootstrap

Initialize Terraform with remote state:

1
2
3
4
5
6
7
8
9
10
11
# main.tf
terraform {
  backend "s3" {
    endpoint = "https://minio.mydomain.net"
    bucket   = "terraform-state"
    key      = "homelab/network"
    region   = "main"
    skip_credentials_validation = true
    skip_region_validation      = true
  }
}

Apply initial config:

1
2
export TF_VAR_nas_ip="10.10.30.5"
terraform apply -target=module.nfs_storage

Optimization & Hardening

ZFS Tuning for Mixed Workloads

Adjust ARC limits for Kubernetes + SMB/CIFS:

1
2
3
4
# /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=2147483648  # 2GB floor
options zfs zfs_arc_max=34359738368 # 32GB max (50% of 64GB RAM)
options zfs l2arc_write_boost=400000000 # Burst to L2ARC (SSD)

Kubernetes Security Policies

  1. Pod Security Admission: ```yaml

    psa.yaml

    apiVersion: apiserver.config.k8s.io/v1 kind: AdmissionConfiguration plugins:

    • name: PodSecurity configuration: defaults: enforce: “restricted” enforce-version: “latest” exemptions: usernames: [“system:serviceaccount:kube-system:*”] ```
  2. Network Policies: ```yaml

    default-deny.yaml

    kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: default-deny namespace: automation spec: podSelector: {} policyTypes:

    • Ingress
    • Egress ```

Day-to-Day Operations

Monitoring Stack

Deploy Prometheus with TrueNAS storage:

1
2
3
helm install prometheus prometheus-community/kube-prometheus-stack \
  --set persistentVolume.storageClass=truenas-iscsi-lz4 \
  --namespace monitoring

Key metrics to alert on:

  • zfs_arc_size / zfs_arc_max > 0.8 – ARC pressure
  • kube_pod_status_ready{condition="false"} – CrashLoopBackoff detection
  • node_filesystem_avail_bytes{mountpoint="/mnt/tank"} / node_filesystem_size_bytes < 0.2 – Storage capacity

Backup Strategy

  1. ZFS Snapshots:
    1
    2
    3
    4
    5
    
    # Daily recursive snapshots
    zfs snapshot -r tank@$(date +%Y%m%d)
       
    # Offsite replication
    zfs send tank/apps@20240501 | ssh backup-host "zfs recv backup/apps"
    
  2. Velero for Kubernetes:
    1
    2
    3
    4
    5
    6
    
    velero install \
      --provider aws \
      --plugins velero/velero-plugin-for-aws:v1.8.0 \
      --bucket velero-backups \
      --secret-file ./credentials \
      --backup-location-config region=default,s3ForcePathStyle="true",s3Url=https://minio.mydomain.net
    

Troubleshooting Guide

Common Issues and Solutions

Problem: ZFS pool shows DEGRADED state
Diagnosis:

1
2
3
4
zpool status -v
  scan: scrub in progress since Tue May 1 12:00:00 2024
        1.14T scanned at 1.2G/s, 256G issued at 300M/s
        0B repaired, 21.96% done

Solution: Replace faulted drive using Web UI → Storage → Pools → Status → Replace

Problem: Kubernetes pods stuck in Pending
Diagnosis:

1
2
3
4
5
kubectl describe pod $POD_NAME
  Events:
    Type     Reason            Age   From               Message
    ----     ------            ----  ----               -------
    Warning  FailedScheduling  5m    default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

Solution: Adjust resource requests or add tolerations for node selectors

Performance Tuning

Slow NFS Exports:

1
2
3
4
5
6
7
8
9
10
11
# /etc/nfs.conf
[nfsd]
threads=16

[exportd]
manage-gids=true

[mountd]
# Disable UDP, force TCP
udp=n
tcp=y

High Kubernetes API Latency:

1
2
3
4
# Edit k3s systemd service
ExecStart=/usr/local/bin/k3s server \
  --kube-apiserver-arg="default-watch-cache-size=2000" \
  --kube-apiserver-arg="watch-cache-sizes=services#500,endpointslices#1000"

Conclusion

Migrating from Raspberry Pis to a purpose-built TrueNAS SCALE homelab eliminated the reliability issues inherent in SD card-based infrastructure while unlocking enterprise-grade capabilities:

  1. Storage Resilience: ZFS scrubbing detects bit rot before it impacts data
  2. Kubernetes Efficiency: Direct PVC provisioning vs. Longhorn overhead
  3. Operational Simplicity: Unified UI for storage and container management

Next Steps:

  • Implement Istio for service mesh across on-prem workloads
  • Test Ceph integration for hyperconverged storage
  • Explore GPU partitioning for AI workloads

Recommended Resources:

The investment in proper homelab infrastructure pays dividends through reduced maintenance overhead and a platform capable of testing production-grade workflows. Whether you’re running home automation or staging cloud migrations, eliminating storage bottlenecks is the foundation for reliable self-hosting.

This post is licensed under CC BY 4.0 by the author.