Post

Im Somewhat New At This

Im Somewhat New At This

I’m Somewhat New At This: A DevOps Veteran’s Guide to Responsible Infrastructure Management

Introduction

The Reddit post showcasing a server cobbled together from spare parts with Erector Set components perfectly encapsulates a rite of passage in our field. This “mad max” approach to infrastructure - while temporarily functional - highlights critical challenges faced by engineers managing systems at any scale.

Homelabs and self-hosted environments serve as vital proving grounds for DevOps professionals. They offer safe spaces to experiment, fail, and learn. However, the gap between “it works on my machine” and production-grade infrastructure represents one of the most significant transitions in a DevOps career.

This guide bridges that gap by transforming scrappy prototypes into resilient systems. We’ll examine:

  1. Core principles of infrastructure management
  2. Systematic approaches to service deployment
  3. Security hardening techniques
  4. Operational best practices
  5. Scaling considerations

Whether you’re running a Raspberry Pi cluster or managing enterprise-grade hardware, these fundamentals separate temporary solutions from sustainable systems. Our journey begins by understanding what we’re actually building.

Understanding Infrastructure Management

What Is Modern Infrastructure Management?

Infrastructure management encompasses the processes, tools, and methodologies used to:

  • Provision computing resources
  • Maintain system availability
  • Ensure security compliance
  • Optimize performance
  • Manage costs
  • Enable scalability

In the context of our Erector Set server example, proper management would transform that temporary solution into a reliable platform.

Key Evolution Points

PeriodCharacteristicsTools
Pre-2010Physical hardware, manual configurationRack diagrams, shell scripts
2010-2015Virtualization dominanceVMware, Hyper-V, Puppet
2015-2020Cloud-native emergenceAWS, Terraform, Kubernetes
2020-PresentHybrid infrastructure, GitOpsArgoCD, Crossplane, Pulumi

Core Principles

  1. Idempotency: Infrastructure should reach the same state regardless of initial conditions
  2. Immutable Infrastructure: Replace rather than modify running systems
  3. Declarative Configuration: Define desired state, not implementation steps
  4. Observability: Comprehensive monitoring, logging, and tracing
  5. Security by Design: Principle of least privilege at all layers

When DIY Makes Sense

ScenarioTemporary SolutionManaged Approach
Proof of ConceptBare metal prototypingCloud Credits
Learning EnvironmentScrap hardware labFree-tier cloud services
Legacy System SupportTemporary workaroundOfficial vendor support

Prerequisites for Responsible Infrastructure

Hardware Considerations

While our Reddit friend used spare parts, sustainable infrastructure requires:

ComponentMinimum RecommendationProduction Standard
CPUx86_64, 4 coresEPYC/Xeon, 8+ cores
RAM8GB ECC32GB+ ECC
Storage256GB SSDRAID 10 NVMe SSDs
Network1GbE10GbE with redundancy
PowerSingle PSUDual redundant PSUs

Software Foundation

  1. Operating System: Ubuntu 22.04 LTS (Linux kernel 5.15+)
  2. Virtualization: KVM/QEMU with libvirt
  3. Orchestration: Docker Engine 24.0+ containerd runtime
  4. Configuration Management: Ansible Core 2.15+

Security Pre-Checks

Before installation:

1
2
3
4
5
6
7
8
# Verify UEFI Secure Boot status
sudo mokutil --sb-state

# Check hardware virtualization support
LC_ALL=C lscpu | grep Virtualization

# Confirm kernel protection mechanisms
cat /proc/cmdline | grep -e "slub_debug=" -e "page_poison=1"

Network Requirements

PortProtocolPurposeSecurity
22TCPSSH AccessKey-based only
80TCPHTTPRedirect to 443
443TCPHTTPSTLS 1.3 only
9090TCPMonitoringVPN-only access

Installation & Configuration

Base OS Deployment

For Ubuntu 22.04 automated installs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Generate encrypted password for installer
mkpasswd --method=SHA-512 --rounds=4096

# Create automated install config
cat > user-data <<EOF
#cloud-config
autoinstall:
  version: 1
  identity:
    hostname: devops-node01
    password: "$6$rounds=4096$NJy4rBpQHC$e..."
    username: sysadmin
EOF

# Launch installer
sudo apt install cloud-image-utils
cloud-localds ./seed.img user-data

Docker Engine Setup

Avoid conflicts with legacy Docker installations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Clean removal of old versions
sudo apt remove docker docker-engine docker.io containerd runc

# Repository setup
sudo apt update
sudo apt install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Configure repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install components
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Verify installation
sudo docker run --rm hello-world

Security Hardening

  1. Container Isolation:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    # Create dedicated namespace
    sudo docker run -it --rm \
      --name secured-container \
      --read-only \
      --tmpfs /run:rw,noexec,nosuid,size=65536k \
      --security-opt no-new-privileges \
      --cap-drop ALL \
      --cap-add CHOWN --cap-add SETGID --cap-add SETUID --cap-add NET_BIND_SERVICE \
      -e DOCKER_CONTENT_TRUST=1 \
      nginx:alpine
    
  2. Daemon Configuration (/etc/docker/daemon.json):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    {
      "userns-remap": "default",
      "log-driver": "json-file",
      "log-opts": {
     "max-size": "10m",
     "max-file": "3"
      },
      "default-ulimits": {
     "nofile": {
       "Name": "nofile",
       "Hard": 65535,
       "Soft": 65535
     }
      },
      "icc": false,
      "live-restore": true,
      "no-new-privileges": true
    }
    

Configuration & Optimization

Network Stack Tuning

1
2
3
4
5
6
7
8
9
10
11
12
13
# Apply kernel parameters
cat >> /etc/sysctl.conf <<EOF
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_syncookies=1
net.ipv4.tcp_max_syn_backlog=8192
EOF

# Reload settings
sudo sysctl -p

Storage Optimization

ZFS configuration for containers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create optimized zpool
sudo zpool create -f \
  -O compression=zstd-9 \
  -O atime=off \
  -O recordsize=1M \
  -O logbias=throughput \
  docker-pool \
  mirror /dev/sda /dev/sdb \
  mirror /dev/sdc /dev/sdd

# Configure Docker storage driver
sudo mkdir /etc/docker
cat > /etc/docker/daemon.json <<EOF
{
  "storage-driver": "zfs",
  "storage-opts": ["zfs.fsname=docker-pool/docker"]
}
EOF

Monitoring Stack

Prometheus configuration for host metrics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['localhost:8080']

  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']

Usage & Operations

Daily Management

  1. Container Inspection: ```bash

    List containers with proper formatting

    docker ps –format “table $CONTAINER_ID\t$CONTAINER_NAMES\t$CONTAINER_STATUS\t$CONTAINER_PORTS”

Inspect container security settings

docker inspect $CONTAINER_ID –format ‘’

1
2
3
4
5
6
7
8
9
2. **Resource Monitoring**:
```bash
# Live container metrics
docker stats --format "table $CONTAINER_NAMES\t$CONTAINER_CPU_PERC\t$CONTAINER_MEM_USAGE\t$CONTAINER_MEM_PERC"

# Generate resource reports
docker stats --no-stream --format \
  ";;;" > container_stats.csv

Backup Strategy

  1. Stateful Container Backup: ```bash

    Create consistent snapshot

    docker stop $CONTAINER_NAMES tar czvf /backups/$CONTAINER_NAMES-$(date +%s).tar.gz
    $(docker inspect –format=’’ $CONTAINER_NAMES) docker start $CONTAINER_NAMES

Verify backup integrity

if ! tar tzf /backups/latest.tar.gz >/dev/null; then echo “Backup verification failed” | mail -s “Backup Alert” admin@example.com fi

1
2
3
4
5
6
7
8
9
10
11
2. **Versioned Configuration Backups**:
```bash
# Clone infrastructure repos
git clone --bare ssh://git@github.com/yourorg/infra.git /backups/infra.git

# Create daily snapshot
cd /backups/infra.git
git fetch --all
git reflog expire --expire=now --all
git gc --prune=now --aggressive

Troubleshooting

Diagnostic Toolkit

  1. Network Analysis: ```bash

    Container network inspection

    docker run -it –rm –net container:$CONTAINER_NAMES nicolaka/netshoot
    tcpdump -ni eth0 -w capture.pcap

DNS resolution testing

docker run –rm –dns 8.8.8.8 alpine nslookup google.com

1
2
3
4
5
6
2. **Performance Diagnostics**:
```bash
# Linux perf tool in container context
docker run --rm -it --privileged --pid=host alpine sh -c \
  'apk add perf && perf record -F 99 -a -g -- sleep 30'
  1. Log Correlation:
    1
    2
    3
    
    # Aggregate container logs
    docker logs $CONTAINER_NAMES 2>&1 | grep -E 'ERR|WARN|CRIT' \
      | jq 'select(.level >= 30)' -c
    

Common Issues

  1. Storage Driver Conflicts:
    1
    2
    3
    4
    
    # Clean reset Docker storage
    sudo systemctl stop docker
    sudo rm -rf /var/lib/docker/*
    sudo systemctl start docker
    
  2. DNS Resolution Failures:
    1
    2
    
    # Force Docker DNS configuration
    docker run --dns 1.1.1.1 --dns-search example.com alpine cat /etc/resolv.conf
    
  3. Resource Exhaustion:
    1
    2
    3
    
    # Identify container memory hogs
    docker stats --no-stream --format \
      ": " | sort -k2 -h
    

Conclusion

Our journey from Erector Set servers to professional infrastructure highlights key DevOps principles: reproducibility through automation, resilience through redundancy, and security through design. While makeshift solutions serve their purpose in learning environments, production systems demand disciplined approaches.

Key takeaways:

  1. Idempotent Infrastructure: Treat servers as disposable cattle, not pets
  2. Observability First: You can’t manage what you can’t measure
  3. Security by Default: Least privilege applies at all layers
  4. Automation as Documentation: Executable runbooks > tribal knowledge

Continue your learning with these resources:

The difference between temporary and professional infrastructure isn’t the hardware cost - it’s the operational discipline applied to whatever resources you control. Whether managing a homelab or enterprise cluster, these principles remain universally applicable.

This post is licensed under CC BY 4.0 by the author.