My Attempt At Replacing Cloud Services

Posted Aug 15, 2025

By Usman Masood Ashraf

views 6 min read

1. Introduction

The growing tension between convenience and digital sovereignty has reached a tipping point. As major cloud providers intensify data collection practices under the guise of “free services,” technical professionals face a critical choice: continue feeding the surveillance machine or reclaim control through self-hosted infrastructure. This isn’t just another homelab experiment – it’s a technical deep dive into architecting a production-grade alternative to commercial cloud ecosystems.

For DevOps engineers and system administrators, replacing cloud services presents unique challenges: maintaining comparable availability, implementing enterprise-grade security, and achieving automation parity – all while avoiding the 24/7 operational overhead that makes cloud platforms appealing. This guide documents my multi-year journey building a privacy-focused infrastructure stack that handles email, file storage, media streaming, and productivity tools without corporate intermediaries.

You’ll learn how to:

Architect services with failure domains and redundancy using Proxmox VE clustering
Implement zero-trust networking with WireGuard and Tailscale
Containerize legacy applications using Docker/Podman without cloud dependencies
Automate TLS certificates with ACME challenges in isolated networks
Achieve 99.9% uptime using distributed storage (Ceph) and load balancing
Enforce GDPR-grade data controls without compliance theater

2. Understanding the Self-Hosted Paradigm Shift

What We’re Replacing
Commercial cloud ecosystems provide vertically integrated services:

[User Devices] → [Cloud Provider] → [Gmail/Drive/Photos/Calendar]

This creates critical vulnerabilities:

Single point of control (provider terms of service)
Data exfiltration via interconnected “free” services
Limited configuration control (e.g., no custom retention policies)

The Self-Hosted Alternative

[User Devices] → [Reverse Proxy] → [NextCloud (Files)]  
                          → [ProtonMail Bridge (Email)]  
                          → [Jellyfin (Media)]  
                          → [Vaultwarden (Password Manager)]

Each component runs on dedicated infrastructure with explicit data boundaries.

Technical Tradeoffs

Factor	Commercial Cloud	Self-Hosted
Uptime SLA	99.9-99.99%	Depends on architecture
Storage Cost (TB/mo)	$23 (GCP) - $40 (AWS)	$5 (HDD) - $15 (SSD)
Security Defaults	Automatic updates	Manual patch management
Data Control	Limited (TOS-bound)	Full cryptographic control

Key Technologies

Proxmox VE: Type-1 hypervisor with Kubernetes integration
Ceph: Distributed storage system with Erasure Coding
Ansible: Infrastructure-as-Code for configuration management
Traefik: Cloud-native edge router with Let’s Encrypt integration

When Self-Hosting Fails
Avoid core business systems requiring:

Global anycast networks (use Cloudflare DNS)
Petabyte-scale object storage (consider Backblaze B2)
AI/ML training clusters (limited GPU alternatives)

3. Prerequisites

Hardware Minimums

3 Nodes (High Availability Cluster):
- CPU: Xeon E3-1230v6 (4c/8t)
- RAM: 32GB ECC DDR4
- Storage: 2x NVMe (OS), 4x 8TB HDD (Ceph OSDs)
- Network: 2x 10GbE (Storage/Public)

Pre-Installation Checklist

Network Architecture:
- VLAN segmentation (Management, Storage, Public)
- BGP peering for anycast services (FRRouting)
- Physical firewall (OPNsense/pfSense)
Security Foundation:
- Hardware Security Module (YubiHSM 2)
- Offline certificate authority (Step CA)
- Encrypted DNS (Unbound + DNS-over-TLS)
Software Requirements:
- Proxmox VE 7.4+ (no-subscription repo)
- Ceph Quincy 17.2.6
- Docker 20.10.23 with containerd
- Ubuntu 22.04.3 LTS (Kernel 5.15 HWE)

Critical Configuration Files
/etc/apt/sources.list.d/proxmox.list:

deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription

/etc/ceph/ceph.conf:

  
[global]
osd_pool_default_size = 3
osd_pool_default_min_size = 2
mon_allow_pool_delete = true

4. Installation & High-Availability Configuration

Proxmox VE Cluster Initialization
First node:

proxmox-boot-tool format /dev/nvme0n1p2 --force
proxmox-boot-tool init /dev/nvme0n1p2
pvecm create HA-CLUSTER -ring0_addr 10.10.10.1

Subsequent nodes:

pvecm add 10.10.10.1 -force -ring0_addr 10.10.10.2

Ceph Deployment
Create OSDs with encryption:

  
ceph-volume lvm create --data /dev/sdb --dmcrypt
ceph-volume lvm create --data /dev/sdc --dmcrypt

Configure CRUSH map for rack-awareness:

  
ceph osd crush add-bucket rack1 rack
ceph osd crush move rack1 root=default
ceph osd crush move osd.0 rack=rack1

Docker with Overlay2 & ZFS

  
cat > /etc/docker/daemon.json <<EOF
{
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true",
    "overlay2.size=100G"
  ],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}
EOF

Service Deployment Example: Nextcloud
docker-compose.yml:

  
version: '3.7'

services:
  nextcloud:
    image: nextcloud:25.0.7-apache
    container_name: nextcloud
    networks:
      - frontend
      - backend
    volumes:
      - nextcloud:/var/www/html
      - /mnt/ceph/nextcloud:/var/www/html/data
    environment:
      - MYSQL_HOST=db
      - REDIS_HOST=redis
      - OVERWRITEPROTOCOL=https
    deploy:
      resources:
        limits:
          memory: 4G

  db:
    image: mariadb:10.11
    container_name: nextcloud-db
    networks:
      - backend
    volumes:
      - db:/var/lib/mysql
    environment:
      - MYSQL_ROOT_PASSWORD_FILE=/run/secrets/db_root_password
    secrets:
      - db_root_password

secrets:
  db_root_password:
    file: ./db_root_password.txt

networks:
  frontend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.22.0.0/24
  backend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.23.0.0/24

volumes:
  nextcloud:
    driver: ceph
    driver_opts:
      name: ceph
      pool: nextcloud
      volume: nextcloud-vol
      monitors: 10.10.10.1:6789,10.10.10.2:6789,10.10.10.3:6789
      secret: $CEPHX_SECRET
  db:
    driver: zfs
    driver_opts:
      size: 100G

Verification Steps
Check Ceph health:

ceph -s
  cluster:
    id:     a7f64266-6b9a-4b88-8b4d-362b0f1a2c7e
    health: HEALTH_OK

Test Docker volume:

  
docker run --rm -v nextcloud:/mnt alpine ls -l /mnt

5. Enterprise-Grade Configuration

Security Hardening

Kernel Parameters:
/etc/sysctl.d/99-hardening.conf:

  
net.ipv4.tcp_syncookies = 1
net.ipv4.conf.all.rp_filter = 1
kernel.kptr_restrict = 2

AppArmor Profiles:
nextcloud-profile:

#include <tunables/global>
   
profile nextcloud flags=(attach_disconnected) {
  #include <abstractions/apache2-common>
  /var/www/html/** r,
  /mnt/ceph/nextcloud/** rw,
  deny /var/www/html/data/*.php rwx,
}

Performance Optimization
Ceph CRUSH Tunables:

ceph osd crush tunables optimal
ceph osd set-require-min-compat-client jewel

ZFS ARC Size Adjustment:

  
echo $((32 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max

Automated Certificate Management
Traefik dynamic configuration (dynamic.yml):

  
tls:
  certificates:
    - certFile: /etc/step/certs/site.crt
      keyFile: /etc/step/certs/site.key
  stores:
    default:
      defaultCertificate:
        certFile: /etc/step/certs/site.crt
        keyFile: /etc/step/certs/site.key

http:
  routers:
    nextcloud:
      rule: "Host(`cloud.example.com`)"
      service: nextcloud
      tls:
        certResolver: step

6. Operational Workflows

Daily Maintenance
Prune Docker resources:

  
docker system prune -af --volumes --filter "until=720h"

ZFS snapshot rotation:

  
zfs snap rpool/data@$(date +%Y%m%d)
zfs destroy -r rpool/data@$(date -d "30 days ago" +%Y%m%d)

Backup Strategy
BorgBackup to remote storage:

  
borg create --stats --progress \
  ssh://backup@nas01:22/mnt/backup/nextcloud::nextcloud-{now} \
  /mnt/ceph/nextcloud \
  --exclude '*.tmp'

Monitoring Stack
Prometheus scrape config:

  
scrape_configs:
  - job_name: 'proxmox'
    static_configs:
      - targets: ['10.10.10.1:9221', '10.10.10.2:9221']
  - job_name: 'ceph'
    metrics_path: /metrics
    static_configs:
      - targets: ['10.10.10.1:9283']

7. Troubleshooting Handbook

Common Failure Scenarios

Symptom	Diagnostic Command	Resolution
Ceph OSD down	`ceph osd tree -f json-pretty`	`systemctl restart ceph-osd@$ID`
Container network failure	`nsenter -t $PID -n ping 8.8.8.8`	Check iptables/Firewalld rules
ZFS pool degraded	`zpool status -v`	`zpool replace pool bad-disk new-disk`

Log Investigation
JournalCTL with time constraints:

  
journalctl --since "2023-07-15 09:00:00" --until "2023-07-15 12:00:00" -u ceph-mon@node1

Docker container inspection:

  
```bash
docker inspect $CONTAINER_ID --format '' | jq

```

8. Conclusion

After 18 months of operation, this self-hosted infrastructure handles:

12TB of family photos/videos (Jellyfin)
300GB of documents (Nextcloud)
50,000+ emails (ProtonMail Bridge)

Critical lessons learned:

Redundancy Is Non-Negotiable: Three-node minimum for any production service
Automate or Perish: Unattended-upgrades + Ansible = survival
Monitor Everything: A single failed OSD can cascade into pool failure

The stack currently achieves 99.82% uptime – not quite enterprise SLA, but sufficient for personal use. For those considering similar migrations: start with non-critical services, implement monitoring before migration, and always maintain offline backups.

Further Resources:

The journey to digital sovereignty requires technical rigor but delivers unparalleled control. As surveillance capitalism intensifies, the ability to maintain private infrastructure becomes not just a technical challenge, but an ethical imperative.

Open Source, Reddit Guides, Kubernetes

This post is licensed under CC BY 4.0 by the author.