Post

I Forgot To Put The

I Forgot To Put The: A DevOps Cautionary Tale About Infrastructure Management

1. Introduction

That sinking feeling when your terminal cursor blinks innocently after you’ve just executed:

1
rm -rf ./*

…only to realize a half-second too late that you forgot the critical directory path in your command. This visceral experience - memorialized in countless Reddit horror stories and sysadmin war stories - represents one of the most universal yet preventable disasters in infrastructure management.

In self-hosted environments and homelabs where systems administrators often operate without enterprise-grade safeguards, the consequences of a mistyped command can be catastrophic. One errant space character or misplaced wildcard can erase years of carefully curated configurations, media libraries, or application data. As DevOps professionals managing critical infrastructure, we must engineer resilience against these human errors.

This comprehensive guide examines:

  • The technical anatomy of destructive commands like rm -rf
  • System hardening techniques to prevent accidental deletions
  • Recovery strategies when disaster strikes
  • Architectural approaches to make infrastructure self-healing
  • Enterprise-grade safeguards you can implement in homelabs

We’ll transform that gut-wrenching “I forgot to put the…” moment from a career-lowlight into a valuable learning opportunity through practical system administration techniques used by professional DevOps teams.

2. Understanding the Problem Space

2.1 The Anatomy of Disaster

At its core, the infamous rm -rf incident combines three critical elements:

  1. Recursive Deletion (-r):
    Operates on directories and their contents recursively

  2. Force Flag (-f):
    Suppresses confirmation prompts and ignores nonexistent files

  3. Path Specification:
    The make-or-break component where a single character can determine infrastructure fate

When combined with shell globbing patterns (*, .*, /*), these parameters become increasingly dangerous:

CommandEffectRisk Level
rm -rf /home/user/*Deletes all visible files in directoryHigh
rm -rf /home/user/.*Deletes all hidden files (.files)High
rm -rf /home/user /*Space before wildcard deletes root (!)Critical

2.2 Why Homelabs Are Vulnerable

Self-hosted environments present unique risk factors:

  1. Lack of Enterprise Safeguards:
    No enterprise backup systems, change controls, or approval workflows

  2. Mixed Criticality Workloads:
    Personal media shares coexist with critical services like home automation

  3. Experimental Mindset:
    Frequent system modifications increase “command fatigue”

  4. Resource Constraints:
    Limited hardware for implementing robust redundancy

2.3 Historical Context

The rm -rf disaster has deep roots in Unix history:

  • 1971: First rm command appears in Unix Version 1
  • 1983: Recursive flag (-r) added in BSD 4.2
  • 1990s: Force flag (-f) becomes common in GNU coreutils
  • 2006: Google employee accidentally deletes production index with rm -rf /*
  • 2012: Linus Torvalds advocates for making rm -rf / require extra flags
  • 2018: Kubernetes CVE-2018-1002105 allows container escape to host deletion

3. Prerequisites for Safe Operations

3.1 System Requirements

Implement these foundational safeguards before any dangerous operations:

  1. User Account Restrictions:
    1
    2
    3
    4
    
    # Create restricted admin user
    sudo useradd -m -s /bin/bash admin
    sudo usermod -aG sudo admin
    sudo visudo # Add: "admin ALL=(ALL) NOPASSWD: ALL, !/usr/bin/rm"
    
  2. Filesystem Selection:
    Use copy-on-write filesystems with snapshots:
    • ZFS (zfs create -o copies=2 tank/home)
    • Btrfs (btrfs subvolume snapshot /home /home/.snapshots/daily)
  3. Mandatory Access Control:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # AppArmor policy for /usr/bin/rm
    /usr/bin/rm {
      deny /**/.*,
      deny /*,
      deny /bin/*,
      deny /sbin/*,
      deny /usr/*,
      audit /home/*,
    }
    

3.2 Pre-Installation Checklist

Before executing any destructive commands:

  1. Verify current directory with pwd && ls
  2. Test command with echo rm -rf * first
  3. Confirm backups are current (borg list /backup)
  4. Open second terminal session as fail-safe
  5. Set filesystem immutable flag (chattr +i critical_file)

4. System Hardening Against Accidental Deletion

4.1 Aliasing Dangerous Commands

Modify shell configuration to prevent footguns:

1
2
3
4
5
6
7
# ~/.bashrc or ~/.zshrc
alias rm='rm -I --preserve-root'
alias chmod='chmod --preserve-root'
alias chown='chown --preserve-root'
alias mv='mv -i'
alias cp='cp -i'
alias ln='ln -i'

4.2 Implementing a Trash System

Replace rm with trash-cli for recoverable deletions:

1
2
3
4
5
6
7
8
9
# Install on Debian/Ubuntu
sudo apt install trash-cli

# Replace system rm (use at your own risk!)
echo 'alias rm="trash-put"' >> ~/.bashrc

# Restore deleted files
trash-list
trash-restore $ITEM_NUMBER

4.3 Kernel-Level Protections

Enable inotify monitoring with auditd:

1
2
3
4
5
6
# Monitor deletions in critical directories
sudo auditctl -w /home -p wa -k home_dir_changes
sudo auditctl -w /etc -p wa -k etc_changes

# Generate reports
sudo ausearch -k home_dir_changes -ts today

5. Recovery Strategies When Prevention Fails

5.1 Filesystem-Specific Recovery

Ext4:

1
2
3
4
5
# 1. Immediately unmount filesystem
sudo umount /dev/sda1

# 2. Use extundelete
sudo extundelete /dev/sda1 --restore-all --output-dir /recovery

ZFS:

1
2
3
4
5
# List available snapshots
zfs list -t snapshot

# Rollback to last known good state
zfs rollback tank/home@autosnap_2023-10-15_04:00:00_daily

5.2 Forensic Recovery Tools

  1. TestDisk:
    Recovers partition tables and boot sectors
    1
    
    sudo testdisk /dev/sdb
    
  2. PhotoRec:
    File carving for 300+ file types
    1
    
    sudo photorec /dev/sdb
    
  3. Scalpel:
    Customizable file carver
    1
    2
    
    # /etc/scalpel/scalpel.conf
    ext4 y 2000000 \x53\x5a\x44\x44\x44\x44\x00\x00\x00\x00
    

5.3 Backup Restoration Workflows

Implement the 3-2-1 backup rule with verification:

1
2
3
4
5
6
7
8
# BorgBackup example
borg create --stats --progress /backup::'{hostname}-{now}' ~

# Verify backup integrity
borg check /backup

# Restore most recent backup
borg extract /backup::$(borg list --last 1 /backup | awk '{print $1}')

6. Architectural Safeguards

6.1 Immutable Infrastructure Patterns

Convert volatile systems to immutable artifacts:

Docker:

1
2
3
4
FROM alpine
RUN apk add --no-cache critical_service
VOLUME /config
CMD ["critical_service"]

Systemd:

1
2
3
4
5
# /etc/systemd/system/critical.service
[Service]
ProtectSystem=strict
ReadWritePaths=/var/lib/critical
InaccessiblePaths=/home

6.2 Declarative Configuration Management

Ansible playbook snippet for safe permissions:

1
2
3
4
5
6
7
8
9
10
11
12
- name: Harden file permissions
  hosts: all
  tasks:
    - name: Set recursive ownership
      ansible.builtin.file:
        path: /etc
        owner: root
        group: root
        mode: '0644'
        recurse: yes
      check_mode: yes
      diff: yes

6.3 Infrastructure as Code Verification

Terratest validation pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
func TestStorageDeletion(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../examples/storage",
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Verify deletion protection
    output := terraform.Output(t, terraformOptions, "deletion_protection")
    assert.Equal(t, "true", output)
}

7. Psychological Safeguards for Operators

7.1 Command Line Mindfulness

Implement pre-execution pauses:

1
2
3
4
5
# Add 2-second delay to destructive commands
dangerous_commands=("rm" "dd" "mkfs" "fdisk")
for cmd in "${dangerous_commands[@]}"; do
    eval "$cmd() { sleep 2; command $cmd \"\$@\"; }"
done

7.2 Shell Prompt Engineering

Make critical states visually obvious:

1
2
3
4
5
6
# ~/.bashrc
if [[ $EUID -eq 0 ]]; then
    PS1="\[\e[1;41m\][DANGER ROOT]\[\e[0m\] \w # "
else
    PS1="\[\e[1;33m\]\u@\h\[\e[0m\] \w \$ "
fi

8. Enterprise Patterns for Homelabs

8.1 Change Approval Workflows

Implement lightweight review processes using Git:

1
2
3
4
5
6
7
# Infrastructure change workflow
git checkout -b storage-change
# Make changes
git commit -m "Modify storage config"
git push origin storage-change
# Require pull request approval before applying
ansible-playbook --check site.yml

8.2 Canary Deployments

Stage changes before full rollout:

1
2
3
4
5
6
7
8
# Phase 1: Single node
ansible-playbook -l canary_node site.yml

# Monitor for 24 hours
journalctl -f -u critical_service

# Phase 2: Full rollout
ansible-playbook site.yml

9. When Disaster Strikes: Incident Response

9.1 Immediate Containment Protocol

  1. Freeze State:
    1
    2
    3
    
    sync; echo 3 > /proc/sys/vm/drop_caches
    # For remote systems
    systemctl stop network.service
    
  2. Capture Forensic Evidence:
    1
    2
    3
    4
    5
    
    # Memory capture
    sudo dd if=/dev/mem of=/tmp/mem.dump bs=1M count=1024
    
    # Disk image
    sudo dd if=/dev/sda of=/evidence/sda.img conv=noerror,sync
    
  3. Damage Assessment:
    1
    2
    
    # Compare against known good state
    diff -qr /current /backup/last_known_good
    

9.2 Post-Mortem Template

Incident Report: Accidental Deletion 2023-10-15

Timeline

  • 14:32: Command executed via SSH session
  • 14:33: Nagios alerts on service downtime
  • 14:37: Restore initiated from offsite backup

Root Causes

  1. Lack of rm alias hardening
  2. No confirmation prompt for recursive deletes
  3. Recent backup hadn’t completed due to disk full error

Corrective Actions

  • Implement safe-rm system-wide
  • Add backup monitoring to alert channel
  • Conduct monthly recovery drills
    ```

10. Conclusion

The “I forgot to put the…” moment remains an ever-present risk in infrastructure management, but through deliberate system design and operational practices, we can transform these near-disasters into valuable resilience tests. By implementing the technical safeguards, architectural patterns, and psychological practices outlined in this guide:

  1. Critical Systems Gain Immunity through immutable infrastructure patterns
  2. Human Errors Become Contained with filesystem safeguards and aliases
  3. Recovery Processes Become Routine via automated backup verification
  4. Operators Develop Resilience through mindfulness practices

The true measure of DevOps maturity isn’t preventing every mistake - that’s impossible - but rather creating systems where a single mistyped command can’t cascade into catastrophic failure. As you rebuild your NAS or reconfigure your homelab, let these principles guide you toward infrastructure that survives both hardware failures and human fallibility.

Further Learning Resources

  1. GNU Coreutils Manual
  2. ZFS Administration Guide
  3. BorgBackup Documentation
  4. Linux Audit Framework
  5. The Twelve-Factor App Methodology

Remember: The most powerful command in your terminal isn’t rm or dd - it’s sudo shutdown -h now when you recognize you’re about to make a catastrophic mistake. Sometimes walking away is the most professional recovery strategy of all.

This post is licensed under CC BY 4.0 by the author.