Post

Aws Is Down Whos Laughing Right Now

Aws Is Down Whos Laughing Right Now

1. Introduction

When AWS US-EAST-1 stumbles, half the internet collapses. Docker builds fail, DynamoDB APIs vanish, and engineers scramble to explain why “IT IS ALWAYS DNS.” Yet in homelabs worldwide, $15/month self-hosted services hum along untouched. This is the reality of modern infrastructure centralization – and the growing rebellion against it.

For DevOps engineers and sysadmins, today’s outages are tomorrow’s resume-generating events. This guide dissects why decentralized infrastructure matters, how to build outage-resistant systems, and why your side-project VPS might outlive AWS Region-wide failures. We’ll explore:

  • The fragility of hyperscale dependency (as demonstrated by the June 2024 DynamoDB DNS outage)
  • Battle-tested self-hosting patterns for critical services
  • Multi-cloud mitigation strategies that don’t require enterprise budgets
  • DNS resilience techniques beyond “just use Route53”

By the end, you’ll transform from cloud consumer to infrastructure contrarian – the engineer laughing when status pages turn red.

2. Understanding Decentralized Infrastructure

What Is Self-Hosted Resilience?

Self-hosting critical services means maintaining operational control when:

  • Cloud provider DNS fails (AWS us-east-1, June 2024)
  • Container registries become unavailable (Docker Hub during AWS outages)
  • Region-specific APIs go dark

Historical Context: The 2020 DynDNS attack, 2021 Fastly outage, and 2024 AWS DNS failure prove single-point dependencies risk internet-wide disruptions.

Key Advantages of Decentralization

| Centralized Cloud | Self-Hosted Alternative |
|——————-|————————-|
| Single DNS provider | Unbound + DNS-over-HTTPS |
| Managed DynamoDB | SQLite/PostgreSQL on NVMe |
| S3 blob storage | MinIO with WAL-G backups |
| ECR container registry | Harbor with Redis cache |

Real-World Example: The Redditor’s $15/month Immich instance survived AWS’ outage because:

  1. No dependency on AWS DNS resolvers
  2. Local Docker image cache
  3. Stateless service design

When Decentralization Becomes Liability

Counterintuitively, self-hosting increases availability only when:

  • You implement automated patching (unattended-upgrades)
  • Configure monitoring equivalent to CloudWatch (Prometheus + Grafana)
  • Maintain tested backups (Borgmatic + Rclone)

3. Prerequisites

Hardware Requirements

| Service | Minimum Specs | Recommended Production |
|———————-|————————|————————|
| Container Host | 2 vCPU, 4GB RAM | Dedicated NVMe, 32GB ECC RAM |
| DNS Resolver | 1 vCPU, 512MB RAM | Anycast-enabled cluster |
| Object Storage | 1TB HDD | Ceph cluster with erasure coding |

Critical Dependencies:

1
2
3
4
5
6
7
# Base OS (Debian 12 example)
sudo apt install -y docker-ce=5:24.0.7-1~debian.12~bookworm \
containerd.io=1.6.31-1 \
docker-buildx-plugin=0.11.2-1~debian.12~bookworm

# Verify no AWS dependencies
dig +trace docker.com @8.8.8.8 | grep 'awsdns'

Security Pre-Checks

  1. Network Isolation:
    1
    2
    
    ufw default deny incoming
    ufw allow from 192.168.1.0/24 to any port 443
    
  2. DNS Control:
    1
    2
    3
    4
    5
    
    # /etc/unbound/unbound.conf
    forward-zone:
      name: "."
      forward-addr: 9.9.9.9@853#dns.quad9.net
      forward-ssl-upstream: yes
    

4. Installation & Setup

Stateless Service Template (Immich Example)

1
2
3
4
5
6
7
8
# Pull images during stable periods
docker pull -a immichproject/immich

# Verify local cache
docker images | grep immichproject/immich

# Persistent volumes only for critical data
docker volume create immich_pgdata

docker-compose.yml Resilience Tweaks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
services:
  immich-server:
    image: immichproject/immich-server:release
    networks:
      - internal_isolated
    dns:
      - 192.168.1.53 # Your local resolver
    deploy:
      resources:
        limits:
          memory: 4GB

networks:
  internal_isolated:
    internal: true # No accidental internet exposure

Validation Steps:

1
2
3
4
5
6
7
# Confirm no external DNS leaks
docker exec $CONTAINER_ID cat /etc/resolv.conf

# Test service isolation
docker run --rm --network container:$CONTAINER_ID nicolaka/netshoot \
curl -sI https://aws.amazon.com | head -n1
# Should timeout if properly isolated

5. Configuration & Optimization

DNS Armor Plating

Stubby Config (DNS-over-TLS):

1
2
3
4
5
6
7
8
9
10
11
12
# /etc/stubby/stubby.yml
resolution_type: GETDNS_RESOLUTION_STUB
dns_transport_list:
  - GETDNS_TRANSPORT_TLS
tls_authentication: GETDNS_AUTHENTICATION_REQUIRED
tls_query_padding_blocksize: 128
round_robin_upstreams: 1 # Failover
upstream_recursive_servers:
  - address_data: 9.9.9.9
    tls_auth_name: "dns.quad9.net"
  - address_data: 1.1.1.1
    tls_auth_name: "cloudflare-dns.com"

Caching Unbound Setup:

1
2
3
4
5
6
7
# /etc/unbound/unbound.conf
server:
    prefetch: yes
    prefetch-key: yes
    cache-min-ttl: 3600 # Survive upstream outages
    serve-expired: yes
    serve-expired-ttl: 86400

Container Registry Mirror

1
2
3
4
5
6
7
8
# Harbor with local cache
docker run -d --name harbor -p 443:443 \
  -v harbor_data:/data \
  goharbor/harbor:2.10.0

# Docker client config
echo '{"registry-mirrors": ["https://harbor.example.com"]}' | \
  sudo tee /etc/docker/daemon.json

6. Usage & Operations

Outage-Proof Daily Operations

Backup Strategy:

1
2
3
4
5
6
7
# PostgreSQL with WAL-G to MinIO (S3 alternative)
wal-g backup-push /var/lib/postgresql/data \
  --config /etc/wal-g/config.json

# Verify without AWS S3 API
MINIO_ALIAS=localbackup
mc ls $MINIO_ALIAS/postgres-backups

Automated Image Updates:

1
2
3
4
5
6
7
8
# Watchtower without Docker Hub dependency
docker run -d --name watchtower \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e WATCHTOWER_POLL_INTERVAL=86400 \
  -e WATCHTOWER_NO_PULL=false \
  --restart unless-stopped \
  containrrr/watchtower:2.6.0 \
  --label-enable --include-stopped

7. Troubleshooting

“When DNS Is Down” Diagnostic Toolkit

Bypass Cloud Resolvers:

1
2
3
4
5
# Direct root server query
dig +norecurse @h.root-servers.net dynamodb.us-east-1.amazonaws.com

# Verify local cache hit
unbound-control dump_cache | grep dynamodb

Container Fallback Testing:

1
2
3
4
5
6
7
8
9
# Force offline mode
iptables -A OUTPUT -p tcp --dport 443 -j DROP

# Validate degraded functionality
docker exec $CONTAINER_ID curl https://aws.amazon.com -m 5
# Expected: "Connection timed out"

# Check service health endpoint
docker exec $CONTAINER_ID wget -qO- localhost:8080/health

8. Conclusion

The June 2024 AWS outage wasn’t an anomaly – it was a stress test. Engineers who designed systems expecting cloud failure maintained availability through:

  1. Decentralized DNS: Local resolvers with aggressive caching
  2. On-Premises Redundancy: Critical service mirrors (container registries, object storage)
  3. State Management: Knowing when SQLite outperforms DynamoDB

For deeper study:

The cloud’s greatest irony? Its most resilient users treat it as expendable. Build accordingly.

This post is licensed under CC BY 4.0 by the author.