I Forgot To Put The
I Forgot To Put The: A DevOps Cautionary Tale About Infrastructure Management
1. Introduction
That sinking feeling when your terminal cursor blinks innocently after you’ve just executed:
1
rm -rf ./*
…only to realize a half-second too late that you forgot the critical directory path in your command. This visceral experience - memorialized in countless Reddit horror stories and sysadmin war stories - represents one of the most universal yet preventable disasters in infrastructure management.
In self-hosted environments and homelabs where systems administrators often operate without enterprise-grade safeguards, the consequences of a mistyped command can be catastrophic. One errant space character or misplaced wildcard can erase years of carefully curated configurations, media libraries, or application data. As DevOps professionals managing critical infrastructure, we must engineer resilience against these human errors.
This comprehensive guide examines:
- The technical anatomy of destructive commands like
rm -rf
- System hardening techniques to prevent accidental deletions
- Recovery strategies when disaster strikes
- Architectural approaches to make infrastructure self-healing
- Enterprise-grade safeguards you can implement in homelabs
We’ll transform that gut-wrenching “I forgot to put the…” moment from a career-lowlight into a valuable learning opportunity through practical system administration techniques used by professional DevOps teams.
2. Understanding the Problem Space
2.1 The Anatomy of Disaster
At its core, the infamous rm -rf
incident combines three critical elements:
Recursive Deletion (
-r
):
Operates on directories and their contents recursivelyForce Flag (
-f
):
Suppresses confirmation prompts and ignores nonexistent filesPath Specification:
The make-or-break component where a single character can determine infrastructure fate
When combined with shell globbing patterns (*
, .*
, /*
), these parameters become increasingly dangerous:
Command | Effect | Risk Level |
---|---|---|
rm -rf /home/user/* | Deletes all visible files in directory | High |
rm -rf /home/user/.* | Deletes all hidden files (.files) | High |
rm -rf /home/user /* | Space before wildcard deletes root (!) | Critical |
2.2 Why Homelabs Are Vulnerable
Self-hosted environments present unique risk factors:
Lack of Enterprise Safeguards:
No enterprise backup systems, change controls, or approval workflowsMixed Criticality Workloads:
Personal media shares coexist with critical services like home automationExperimental Mindset:
Frequent system modifications increase “command fatigue”Resource Constraints:
Limited hardware for implementing robust redundancy
2.3 Historical Context
The rm -rf
disaster has deep roots in Unix history:
- 1971: First
rm
command appears in Unix Version 1 - 1983: Recursive flag (
-r
) added in BSD 4.2 - 1990s: Force flag (
-f
) becomes common in GNU coreutils - 2006: Google employee accidentally deletes production index with
rm -rf /*
- 2012: Linus Torvalds advocates for making
rm -rf /
require extra flags - 2018: Kubernetes CVE-2018-1002105 allows container escape to host deletion
3. Prerequisites for Safe Operations
3.1 System Requirements
Implement these foundational safeguards before any dangerous operations:
- User Account Restrictions:
1 2 3 4
# Create restricted admin user sudo useradd -m -s /bin/bash admin sudo usermod -aG sudo admin sudo visudo # Add: "admin ALL=(ALL) NOPASSWD: ALL, !/usr/bin/rm"
- Filesystem Selection:
Use copy-on-write filesystems with snapshots:- ZFS (
zfs create -o copies=2 tank/home
) - Btrfs (
btrfs subvolume snapshot /home /home/.snapshots/daily
)
- ZFS (
- Mandatory Access Control:
1 2 3 4 5 6 7 8 9
# AppArmor policy for /usr/bin/rm /usr/bin/rm { deny /**/.*, deny /*, deny /bin/*, deny /sbin/*, deny /usr/*, audit /home/*, }
3.2 Pre-Installation Checklist
Before executing any destructive commands:
- Verify current directory with
pwd && ls
- Test command with
echo rm -rf *
first - Confirm backups are current (
borg list /backup
) - Open second terminal session as fail-safe
- Set filesystem immutable flag (
chattr +i critical_file
)
4. System Hardening Against Accidental Deletion
4.1 Aliasing Dangerous Commands
Modify shell configuration to prevent footguns:
1
2
3
4
5
6
7
# ~/.bashrc or ~/.zshrc
alias rm='rm -I --preserve-root'
alias chmod='chmod --preserve-root'
alias chown='chown --preserve-root'
alias mv='mv -i'
alias cp='cp -i'
alias ln='ln -i'
4.2 Implementing a Trash System
Replace rm
with trash-cli for recoverable deletions:
1
2
3
4
5
6
7
8
9
# Install on Debian/Ubuntu
sudo apt install trash-cli
# Replace system rm (use at your own risk!)
echo 'alias rm="trash-put"' >> ~/.bashrc
# Restore deleted files
trash-list
trash-restore $ITEM_NUMBER
4.3 Kernel-Level Protections
Enable inotify monitoring with auditd:
1
2
3
4
5
6
# Monitor deletions in critical directories
sudo auditctl -w /home -p wa -k home_dir_changes
sudo auditctl -w /etc -p wa -k etc_changes
# Generate reports
sudo ausearch -k home_dir_changes -ts today
5. Recovery Strategies When Prevention Fails
5.1 Filesystem-Specific Recovery
Ext4:
1
2
3
4
5
# 1. Immediately unmount filesystem
sudo umount /dev/sda1
# 2. Use extundelete
sudo extundelete /dev/sda1 --restore-all --output-dir /recovery
ZFS:
1
2
3
4
5
# List available snapshots
zfs list -t snapshot
# Rollback to last known good state
zfs rollback tank/home@autosnap_2023-10-15_04:00:00_daily
5.2 Forensic Recovery Tools
- TestDisk:
Recovers partition tables and boot sectors1
sudo testdisk /dev/sdb
- PhotoRec:
File carving for 300+ file types1
sudo photorec /dev/sdb
- Scalpel:
Customizable file carver1 2
# /etc/scalpel/scalpel.conf ext4 y 2000000 \x53\x5a\x44\x44\x44\x44\x00\x00\x00\x00
5.3 Backup Restoration Workflows
Implement the 3-2-1 backup rule with verification:
1
2
3
4
5
6
7
8
# BorgBackup example
borg create --stats --progress /backup::'{hostname}-{now}' ~
# Verify backup integrity
borg check /backup
# Restore most recent backup
borg extract /backup::$(borg list --last 1 /backup | awk '{print $1}')
6. Architectural Safeguards
6.1 Immutable Infrastructure Patterns
Convert volatile systems to immutable artifacts:
Docker:
1
2
3
4
FROM alpine
RUN apk add --no-cache critical_service
VOLUME /config
CMD ["critical_service"]
Systemd:
1
2
3
4
5
# /etc/systemd/system/critical.service
[Service]
ProtectSystem=strict
ReadWritePaths=/var/lib/critical
InaccessiblePaths=/home
6.2 Declarative Configuration Management
Ansible playbook snippet for safe permissions:
1
2
3
4
5
6
7
8
9
10
11
12
- name: Harden file permissions
hosts: all
tasks:
- name: Set recursive ownership
ansible.builtin.file:
path: /etc
owner: root
group: root
mode: '0644'
recurse: yes
check_mode: yes
diff: yes
6.3 Infrastructure as Code Verification
Terratest validation pipeline:
1
2
3
4
5
6
7
8
9
10
11
12
func TestStorageDeletion(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/storage",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Verify deletion protection
output := terraform.Output(t, terraformOptions, "deletion_protection")
assert.Equal(t, "true", output)
}
7. Psychological Safeguards for Operators
7.1 Command Line Mindfulness
Implement pre-execution pauses:
1
2
3
4
5
# Add 2-second delay to destructive commands
dangerous_commands=("rm" "dd" "mkfs" "fdisk")
for cmd in "${dangerous_commands[@]}"; do
eval "$cmd() { sleep 2; command $cmd \"\$@\"; }"
done
7.2 Shell Prompt Engineering
Make critical states visually obvious:
1
2
3
4
5
6
# ~/.bashrc
if [[ $EUID -eq 0 ]]; then
PS1="\[\e[1;41m\][DANGER ROOT]\[\e[0m\] \w # "
else
PS1="\[\e[1;33m\]\u@\h\[\e[0m\] \w \$ "
fi
8. Enterprise Patterns for Homelabs
8.1 Change Approval Workflows
Implement lightweight review processes using Git:
1
2
3
4
5
6
7
# Infrastructure change workflow
git checkout -b storage-change
# Make changes
git commit -m "Modify storage config"
git push origin storage-change
# Require pull request approval before applying
ansible-playbook --check site.yml
8.2 Canary Deployments
Stage changes before full rollout:
1
2
3
4
5
6
7
8
# Phase 1: Single node
ansible-playbook -l canary_node site.yml
# Monitor for 24 hours
journalctl -f -u critical_service
# Phase 2: Full rollout
ansible-playbook site.yml
9. When Disaster Strikes: Incident Response
9.1 Immediate Containment Protocol
- Freeze State:
1 2 3
sync; echo 3 > /proc/sys/vm/drop_caches # For remote systems systemctl stop network.service
- Capture Forensic Evidence:
1 2 3 4 5
# Memory capture sudo dd if=/dev/mem of=/tmp/mem.dump bs=1M count=1024 # Disk image sudo dd if=/dev/sda of=/evidence/sda.img conv=noerror,sync
- Damage Assessment:
1 2
# Compare against known good state diff -qr /current /backup/last_known_good
9.2 Post-Mortem Template
Incident Report: Accidental Deletion 2023-10-15
Timeline
- 14:32: Command executed via SSH session
- 14:33: Nagios alerts on service downtime
- 14:37: Restore initiated from offsite backup
Root Causes
- Lack of
rm
alias hardening - No confirmation prompt for recursive deletes
- Recent backup hadn’t completed due to disk full error
Corrective Actions
- Implement
safe-rm
system-wide - Add backup monitoring to alert channel
- Conduct monthly recovery drills
```
10. Conclusion
The “I forgot to put the…” moment remains an ever-present risk in infrastructure management, but through deliberate system design and operational practices, we can transform these near-disasters into valuable resilience tests. By implementing the technical safeguards, architectural patterns, and psychological practices outlined in this guide:
- Critical Systems Gain Immunity through immutable infrastructure patterns
- Human Errors Become Contained with filesystem safeguards and aliases
- Recovery Processes Become Routine via automated backup verification
- Operators Develop Resilience through mindfulness practices
The true measure of DevOps maturity isn’t preventing every mistake - that’s impossible - but rather creating systems where a single mistyped command can’t cascade into catastrophic failure. As you rebuild your NAS or reconfigure your homelab, let these principles guide you toward infrastructure that survives both hardware failures and human fallibility.
Further Learning Resources
- GNU Coreutils Manual
- ZFS Administration Guide
- BorgBackup Documentation
- Linux Audit Framework
- The Twelve-Factor App Methodology
Remember: The most powerful command in your terminal isn’t rm
or dd
- it’s sudo shutdown -h now
when you recognize you’re about to make a catastrophic mistake. Sometimes walking away is the most professional recovery strategy of all.