Government Surplus Find

Posted Aug 2, 2025

By Usman Masood Ashraf

views 7 min read

Government Surplus Find: Resurrecting HPC Hardware for DevOps Infrastructure

1. Introduction

The Reddit post describing a $150 acquisition of 2,700 lbs of decommissioned HPC hardware from Oak Ridge National Laboratory perfectly illustrates both the opportunity and challenge of government surplus finds. For DevOps engineers and sysadmins building homelabs or self-hosted infrastructure, such discoveries represent potential gold mines of enterprise-grade hardware at scrap metal prices—if you can overcome the technical hurdles.

This scenario raises critical questions about:

Power requirements (480V three-phase vs residential 240V)
Proprietary hardware compatibility
Thermal management of high-density compute
Repurposing specialized accelerators (Xeon Phi, Tesla M60)
Enterprise component integration into modern DevOps workflows

In this comprehensive guide, we’ll examine:

Technical assessment of surplus HPC components
Power conversion strategies for home use
Modern DevOps applications for legacy hardware
Performance optimization techniques
Real-world integration case studies

For engineers operating self-hosted environments, mastering these skills bridges the gap between enterprise-grade infrastructure and budget constraints while providing unparalleled hands-on experience with specialized hardware.

2. Understanding Government Surplus HPC Hardware

2.1 What is Government Surplus Computing Equipment?

Federal agencies like DOE national labs regularly decommission HPC systems through platforms like GSA Auctions. These typically include:

Compute nodes with enterprise CPUs (Xeon E5/E7 series)
Specialized accelerators (NVIDIA Tesla, Intel Xeon Phi)
High-speed interconnects (InfiniBand, Omni-Path)
Enterprise storage arrays
Custom cooling solutions

2.2 Technical Specifications Breakdown

Based on the Reddit post description, the hardware likely includes:

Component	Specifications	Modern Equivalent
CPUs	Dual Xeon E5-2600 v3/v4 (Haswell/Broadwell)	Xeon Silver 4210R
Memory	256GB DDR3 ECC REG	DDR4 3200MHz RDIMM
GPUs	NVIDIA Tesla M60 (compute-focused)	NVIDIA T4
Coprocessors	Intel Xeon Phi 7120P (61 cores)	Intel GPU Max Series
Storage	Likely SAS3 SSDs or NVMe	U.2 NVMe Gen4
Networking	10GbE or InfiniBand QDR/FDR	100Gb EDR InfiniBand

2.3 Advantages of Surplus HPC Hardware

Cost Efficiency: 90-95% discount vs. new hardware
Educational Value: Hands-on experience with enterprise architectures
Performance Density: Compute/$$ ratio remains competitive
Sustainability: Prevents e-waste through reuse

2.4 Challenges and Limitations

Power Requirements: 480V three-phase vs residential 240V split-phase
Thermal Design Power (TDP): Typical 1U server = 400-800W
Component Obsolescence: DDR3 memory, PCIe Gen3 limitations
Driver/Software Support: Legacy accelerators may lack modern CUDA support

2.5 Modern DevOps Use Cases

Despite their age, these components remain viable for:

CI/CD build farms (CPU-intensive compilation)
Machine learning experimentation (TensorFlow 1.x/PyTorch legacy support)
Distributed storage testing (Ceph, MinIO)
Network stress testing
Hypervisor benchmarking

3. Prerequisites

3.1 Hardware Requirements

Minimum for Partial Utilization:

240V 30A circuit (NEMA L6-30 receptacle)
2000VA UPS with pure sine wave output
4U+ server chassis with 120mm+ fan support
PCIe risers for GPU installation

Recommended for Full Rack:

Step-down transformer (480V→208V)
30A 208V PDU
Liquid cooling loop (40mm fan alternatives)
Noise-dampened rack/enclosure

3.2 Software Compatibility

Component	Supported OS	Driver Requirements
Tesla M60	Linux Kernel 4.4+	NVIDIA 450.80.02+
Xeon Phi 7120P	RHEL/CentOS 7	MPSS 3.8.1
Intel C600 PCH	Any x86_64 OS	SATA/AHCI drivers

3.3 Safety Considerations

Electrical Safety: Always verify input voltage before powering on
Lifting Protocols: 2+ people for blades >30 lbs
ESD Protection: Use grounded wrist straps
Thermal Monitoring: IPMI setup mandatory before full load

3.4 Pre-Installation Checklist

Verify component compatibility matrix
Confirm BIOS/UEFI support for intended OS
Prepare firmware recovery media
Test PSUs with variable AC source
Establish IPMI/KVM access

4. Installation & Setup

4.1 Power Conversion Setup

For 480V→240V conversion:

  
# Calculate power requirements per node
ipmitool -H $BMC_IP -U admin -P admin dcmi power reading

# Transformer sizing example (3 nodes):
# Each node: 750W nominal, 900W peak
# Total: 3 × 900W = 2700W → 3kVA transformer minimum

4.2 Base Operating System Installation

Debian 11 (Bullseye) Minimal Install:

  
# Enable non-free repositories
sudo tee /etc/apt/sources.list.d/non-free.list <<EOF
deb http://deb.debian.org/debian bullseye main contrib non-free
EOF

# Install kernel with broader hardware support
sudo apt install linux-image-cloud-amd64 firmware-misc-nonfree

4.3 Accelerator Drivers Installation

NVIDIA Tesla M60:

  
# Add NVIDIA repository
curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/amd64 /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install drivers
sudo apt-get install nvidia-driver-450-server nvidia-cuda-toolkit

Intel Xeon Phi Coprocessor:

  
# Download MPSS stack
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/19114/l_mpss_3.8.1_supp_offline_public.tgz

# Install dependencies
sudo apt install libtbb2 dkms libnuma1 openssh-client

# Install MPSS
sudo ./INSTALL.sh --default

4.4 Infrastructure-as-Code Provisioning

Ansible Playbook for Node Setup:

  
---
- name: Configure HPC node
  hosts: all
  become: yes
  tasks:
    - name: Set performance governor
      ansible.builtin.shell: |
        echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

    - name: Configure hugepages
      sysctl:
        name: vm.nr_hugepages
        value: 1024
        state: present
        reload: yes

    - name: Install NVIDIA Docker runtime
      apt:
        name: nvidia-container-toolkit
        state: present

5. Configuration & Optimization

5.1 BIOS Tuning Recommendations

Critical Settings:

NUMA: Enabled
Power Policy: Maximum Performance
C-States: Disabled
Turbo Boost: Enabled
VT-d/IOMMU: Enabled for GPU passthrough

5.2 Thermal Management

Create custom fan curve via IPMI:

  
# Set fan speed to 50% at 40°C, 75% at 60°C, 100% at 70°C
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x01 0x00
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x02 0xff 0x28  # 40°C
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x02 0xff 0x4b  # 75°C
ipmitool -H $BMC_IP -U admin -P admin raw 0x30 0x30 0x02 0xff 0x64  # 100°C

5.3 Kubernetes Device Plugin Configuration

NVIDIA GPU Plugin DaemonSet:

  
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
spec:
  template:
    spec:
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.12.3
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]

5.4 Security Hardening Checklist

Disable IPMI default credentials
Implement VLAN segmentation
Enable UEFI Secure Boot
Configure TPM-backed disk encryption
Restrict PCIe device access via IOMMU groups

6. Usage & Operations

6.1 Monitoring Stack Configuration

Prometheus Exporter Setup:

  
# Node Exporter with HPC metrics
docker run -d \
  --name node_exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host \
  --collector.ipmi \
  --collector.nfs \
  --collector.nvidia_gpu \
  --collector.mic

6.2 Containerized Workload Example

CUDA Development Container:

  
docker run -it --gpus all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  nvidia/cuda:11.0-base nvidia-smi

# Expected Output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

6.3 Maintenance Procedures

Monthly Maintenance Checklist:

Reapply thermal paste (2-year interval)
Clean air filters/radiators
Validate backup power systems
Rotate PCIe cards to prevent slot wear
Update firmware via vendor ISOs

7. Troubleshooting

7.1 Common Issues and Solutions

Problem: GPUs not detected after reboot
Fix: Check PCIe slot power limits in BIOS

Problem: Xeon Phi cards failing to initialize
Fix: Update MPSS stack and reset coprocessor:

sudo micctrl --resetdefaults mic0

Problem: System instability under load
Diagnosis:

  
# Check power supply metrics
ipmitool dcmi power reading
# Validate thermal margins
ipmitool sensor list | grep -E "Temp|Power"

7.2 Performance Tuning Diagnostics

Identify NUMA Issues:

numactl --hardware
lstopo-no-graphics

GPU Utilization Analysis:

  
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv

8. Conclusion

The $150 Oak Ridge surplus find exemplifies how DevOps engineers can leverage decommissioned HPC hardware to build powerful self-hosted infrastructure. While requiring specialized knowledge in power systems, thermal management, and legacy hardware support, these systems provide unparalleled opportunities for:

Experimenting with large-scale distributed systems
Developing hardware-accelerated workloads
Understanding enterprise-grade infrastructure constraints
Building cost-effective training environments

For those embarking on similar projects, prioritize:

Electrical safety certifications
Component compatibility matrices
Long-term maintenance planning
Realistic performance expectations

Further Resources:

When approached with proper technical diligence, government surplus hardware transforms from obsolete equipment into powerful DevOps learning platforms that bridge the gap between academic knowledge and production-grade infrastructure demands.

Open Source, Reddit Guides, Kubernetes

This post is licensed under CC BY 4.0 by the author.