Post

Government Surplus Find

Government Surplus Find: Resurrecting HPC Hardware for DevOps Infrastructure

1. Introduction

The Reddit post describing a $150 acquisition of 2,700 lbs of decommissioned HPC hardware from Oak Ridge National Laboratory perfectly illustrates both the opportunity and challenge of government surplus finds. For DevOps engineers and sysadmins building homelabs or self-hosted infrastructure, such discoveries represent potential gold mines of enterprise-grade hardware at scrap metal prices—if you can overcome the technical hurdles.

This scenario raises critical questions about:

  • Power requirements (480V three-phase vs residential 240V)
  • Proprietary hardware compatibility
  • Thermal management of high-density compute
  • Repurposing specialized accelerators (Xeon Phi, Tesla M60)
  • Enterprise component integration into modern DevOps workflows

In this comprehensive guide, we’ll examine:

  1. Technical assessment of surplus HPC components
  2. Power conversion strategies for home use
  3. Modern DevOps applications for legacy hardware
  4. Performance optimization techniques
  5. Real-world integration case studies

For engineers operating self-hosted environments, mastering these skills bridges the gap between enterprise-grade infrastructure and budget constraints while providing unparalleled hands-on experience with specialized hardware.

2. Understanding Government Surplus HPC Hardware

2.1 What is Government Surplus Computing Equipment?

Federal agencies like DOE national labs regularly decommission HPC systems through platforms like GSA Auctions. These typically include:

  • Compute nodes with enterprise CPUs (Xeon E5/E7 series)
  • Specialized accelerators (NVIDIA Tesla, Intel Xeon Phi)
  • High-speed interconnects (InfiniBand, Omni-Path)
  • Enterprise storage arrays
  • Custom cooling solutions

2.2 Technical Specifications Breakdown

Based on the Reddit post description, the hardware likely includes:

ComponentSpecificationsModern Equivalent
CPUsDual Xeon E5-2600 v3/v4 (Haswell/Broadwell)Xeon Silver 4210R
Memory256GB DDR3 ECC REGDDR4 3200MHz RDIMM
GPUsNVIDIA Tesla M60 (compute-focused)NVIDIA T4
CoprocessorsIntel Xeon Phi 7120P (61 cores)Intel GPU Max Series
StorageLikely SAS3 SSDs or NVMeU.2 NVMe Gen4
Networking10GbE or InfiniBand QDR/FDR100Gb EDR InfiniBand

2.3 Advantages of Surplus HPC Hardware

  1. Cost Efficiency: 90-95% discount vs. new hardware
  2. Educational Value: Hands-on experience with enterprise architectures
  3. Performance Density: Compute/$$ ratio remains competitive
  4. Sustainability: Prevents e-waste through reuse

2.4 Challenges and Limitations

  1. Power Requirements: 480V three-phase vs residential 240V split-phase
  2. Thermal Design Power (TDP): Typical 1U server = 400-800W
  3. Component Obsolescence: DDR3 memory, PCIe Gen3 limitations
  4. Driver/Software Support: Legacy accelerators may lack modern CUDA support

2.5 Modern DevOps Use Cases

Despite their age, these components remain viable for:

  • CI/CD build farms (CPU-intensive compilation)
  • Machine learning experimentation (TensorFlow 1.x/PyTorch legacy support)
  • Distributed storage testing (Ceph, MinIO)
  • Network stress testing
  • Hypervisor benchmarking

3. Prerequisites

3.1 Hardware Requirements

Minimum for Partial Utilization:

  • 240V 30A circuit (NEMA L6-30 receptacle)
  • 2000VA UPS with pure sine wave output
  • 4U+ server chassis with 120mm+ fan support
  • PCIe risers for GPU installation

Recommended for Full Rack:

  • Step-down transformer (480V→208V)
  • 30A 208V PDU
  • Liquid cooling loop (40mm fan alternatives)
  • Noise-dampened rack/enclosure

3.2 Software Compatibility

ComponentSupported OSDriver Requirements
Tesla M60Linux Kernel 4.4+NVIDIA 450.80.02+
Xeon Phi 7120PRHEL/CentOS 7MPSS 3.8.1
Intel C600 PCHAny x86_64 OSSATA/AHCI drivers

3.3 Safety Considerations

  1. Electrical Safety: Always verify input voltage before powering on
  2. Lifting Protocols: 2+ people for blades >30 lbs
  3. ESD Protection: Use grounded wrist straps
  4. Thermal Monitoring: IPMI setup mandatory before full load

3.4 Pre-Installation Checklist

  1. Verify component compatibility matrix
  2. Confirm BIOS/UEFI support for intended OS
  3. Prepare firmware recovery media
  4. Test PSUs with variable AC source
  5. Establish IPMI/KVM access

4. Installation & Setup

4.1 Power Conversion Setup

For 480V→240V conversion:

1
2
3
4
5
6
# Calculate power requirements per node
ipmitool -H $BMC_IP -U admin -P admin dcmi power reading

# Transformer sizing example (3 nodes):
# Each node: 750W nominal, 900W peak
# Total: 3 × 900W = 2700W → 3kVA transformer minimum

4.2 Base Operating System Installation

Debian 11 (Bullseye) Minimal Install:

1
2
3
4
5
6
7
# Enable non-free repositories
sudo tee /etc/apt/sources.list.d/non-free.list <<EOF
deb http://deb.debian.org/debian bullseye main contrib non-free
EOF

# Install kernel with broader hardware support
sudo apt install linux-image-cloud-amd64 firmware-misc-nonfree

4.3 Accelerator Drivers Installation

NVIDIA Tesla M60:

1
2
3
4
5
6
# Add NVIDIA repository
curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/amd64 /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install drivers
sudo apt-get install nvidia-driver-450-server nvidia-cuda-toolkit

Intel Xeon Phi Coprocessor:

1
2
3
4
5
6
7
8
# Download MPSS stack
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/19114/l_mpss_3.8.1_supp_offline_public.tgz

# Install dependencies
sudo apt install libtbb2 dkms libnuma1 openssh-client

# Install MPSS
sudo ./INSTALL.sh --default

4.4 Infrastructure-as-Code Provisioning

Ansible Playbook for Node Setup:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
---
- name: Configure HPC node
  hosts: all
  become: yes
  tasks:
    - name: Set performance governor
      ansible.builtin.shell: |
        echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

    - name: Configure hugepages
      sysctl:
        name: vm.nr_hugepages
        value: 1024
        state: present
        reload: yes

    - name: Install NVIDIA Docker runtime
      apt:
        name: nvidia-container-toolkit
        state: present

5. Configuration & Optimization

5.1 BIOS Tuning Recommendations

Critical Settings:

  • NUMA: Enabled
  • Power Policy: Maximum Performance
  • C-States: Disabled
  • Turbo Boost: Enabled
  • VT-d/IOMMU: Enabled for GPU passthrough

5.2 Thermal Management

Create custom fan curve via IPMI:

1
2
3
4
5
# Set fan speed to 50% at 40°C, 75% at 60°C, 100% at 70°C
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x01 0x00
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x02 0xff 0x28  # 40°C
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x02 0xff 0x4b  # 75°C
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x02 0xff 0x64  # 100°C

5.3 Kubernetes Device Plugin Configuration

NVIDIA GPU Plugin DaemonSet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
spec:
  template:
    spec:
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.12.3
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]

5.4 Security Hardening Checklist

  1. Disable IPMI default credentials
  2. Implement VLAN segmentation
  3. Enable UEFI Secure Boot
  4. Configure TPM-backed disk encryption
  5. Restrict PCIe device access via IOMMU groups

6. Usage & Operations

6.1 Monitoring Stack Configuration

Prometheus Exporter Setup:

1
2
3
4
5
6
7
8
9
10
11
12
# Node Exporter with HPC metrics
docker run -d \
  --name node_exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host \
  --collector.ipmi \
  --collector.nfs \
  --collector.nvidia_gpu \
  --collector.mic

6.2 Containerized Workload Example

CUDA Development Container:

1
2
3
4
5
6
7
8
9
10
docker run -it --gpus all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  nvidia/cuda:11.0-base nvidia-smi

# Expected Output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

6.3 Maintenance Procedures

Monthly Maintenance Checklist:

  1. Reapply thermal paste (2-year interval)
  2. Clean air filters/radiators
  3. Validate backup power systems
  4. Rotate PCIe cards to prevent slot wear
  5. Update firmware via vendor ISOs

7. Troubleshooting

7.1 Common Issues and Solutions

Problem: GPUs not detected after reboot
Fix: Check PCIe slot power limits in BIOS

Problem: Xeon Phi cards failing to initialize
Fix: Update MPSS stack and reset coprocessor:

1
sudo micctrl --resetdefaults mic0

Problem: System instability under load
Diagnosis:

1
2
3
4
# Check power supply metrics
ipmitool dcmi power reading
# Validate thermal margins
ipmitool sensor list | grep -E "Temp|Power"

7.2 Performance Tuning Diagnostics

Identify NUMA Issues:

1
2
numactl --hardware
lstopo-no-graphics

GPU Utilization Analysis:

1
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv

8. Conclusion

The $150 Oak Ridge surplus find exemplifies how DevOps engineers can leverage decommissioned HPC hardware to build powerful self-hosted infrastructure. While requiring specialized knowledge in power systems, thermal management, and legacy hardware support, these systems provide unparalleled opportunities for:

  1. Experimenting with large-scale distributed systems
  2. Developing hardware-accelerated workloads
  3. Understanding enterprise-grade infrastructure constraints
  4. Building cost-effective training environments

For those embarking on similar projects, prioritize:

  • Electrical safety certifications
  • Component compatibility matrices
  • Long-term maintenance planning
  • Realistic performance expectations

Further Resources:

When approached with proper technical diligence, government surplus hardware transforms from obsolete equipment into powerful DevOps learning platforms that bridge the gap between academic knowledge and production-grade infrastructure demands.

This post is licensed under CC BY 4.0 by the author.