Post

My Attempt At Replacing Cloud Services

My Attempt At Replacing Cloud Services

1. Introduction

The growing tension between convenience and digital sovereignty has reached a tipping point. As major cloud providers intensify data collection practices under the guise of “free services,” technical professionals face a critical choice: continue feeding the surveillance machine or reclaim control through self-hosted infrastructure. This isn’t just another homelab experiment – it’s a technical deep dive into architecting a production-grade alternative to commercial cloud ecosystems.

For DevOps engineers and system administrators, replacing cloud services presents unique challenges: maintaining comparable availability, implementing enterprise-grade security, and achieving automation parity – all while avoiding the 24/7 operational overhead that makes cloud platforms appealing. This guide documents my multi-year journey building a privacy-focused infrastructure stack that handles email, file storage, media streaming, and productivity tools without corporate intermediaries.

You’ll learn how to:

  • Architect services with failure domains and redundancy using Proxmox VE clustering
  • Implement zero-trust networking with WireGuard and Tailscale
  • Containerize legacy applications using Docker/Podman without cloud dependencies
  • Automate TLS certificates with ACME challenges in isolated networks
  • Achieve 99.9% uptime using distributed storage (Ceph) and load balancing
  • Enforce GDPR-grade data controls without compliance theater

2. Understanding the Self-Hosted Paradigm Shift

What We’re Replacing
Commercial cloud ecosystems provide vertically integrated services:

1
[User Devices] → [Cloud Provider] → [Gmail/Drive/Photos/Calendar]

This creates critical vulnerabilities:

  1. Single point of control (provider terms of service)
  2. Data exfiltration via interconnected “free” services
  3. Limited configuration control (e.g., no custom retention policies)

The Self-Hosted Alternative

1
2
3
4
[User Devices] → [Reverse Proxy] → [NextCloud (Files)]  
                          → [ProtonMail Bridge (Email)]  
                          → [Jellyfin (Media)]  
                          → [Vaultwarden (Password Manager)]

Each component runs on dedicated infrastructure with explicit data boundaries.

Technical Tradeoffs

FactorCommercial CloudSelf-Hosted
Uptime SLA99.9-99.99%Depends on architecture
Storage Cost (TB/mo)$23 (GCP) - $40 (AWS)$5 (HDD) - $15 (SSD)
Security DefaultsAutomatic updatesManual patch management
Data ControlLimited (TOS-bound)Full cryptographic control

Key Technologies

  • Proxmox VE: Type-1 hypervisor with Kubernetes integration
  • Ceph: Distributed storage system with Erasure Coding
  • Ansible: Infrastructure-as-Code for configuration management
  • Traefik: Cloud-native edge router with Let’s Encrypt integration

When Self-Hosting Fails
Avoid core business systems requiring:

  • Global anycast networks (use Cloudflare DNS)
  • Petabyte-scale object storage (consider Backblaze B2)
  • AI/ML training clusters (limited GPU alternatives)

3. Prerequisites

Hardware Minimums

  • 3 Nodes (High Availability Cluster):
    • CPU: Xeon E3-1230v6 (4c/8t)
    • RAM: 32GB ECC DDR4
    • Storage: 2x NVMe (OS), 4x 8TB HDD (Ceph OSDs)
    • Network: 2x 10GbE (Storage/Public)

Pre-Installation Checklist

  1. Network Architecture:
    • VLAN segmentation (Management, Storage, Public)
    • BGP peering for anycast services (FRRouting)
    • Physical firewall (OPNsense/pfSense)
  2. Security Foundation:
    • Hardware Security Module (YubiHSM 2)
    • Offline certificate authority (Step CA)
    • Encrypted DNS (Unbound + DNS-over-TLS)
  3. Software Requirements:
    • Proxmox VE 7.4+ (no-subscription repo)
    • Ceph Quincy 17.2.6
    • Docker 20.10.23 with containerd
    • Ubuntu 22.04.3 LTS (Kernel 5.15 HWE)

Critical Configuration Files
/etc/apt/sources.list.d/proxmox.list:

1
deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription

/etc/ceph/ceph.conf:

1
2
3
4
[global]
osd_pool_default_size = 3
osd_pool_default_min_size = 2
mon_allow_pool_delete = true

4. Installation & High-Availability Configuration

Proxmox VE Cluster Initialization
First node:

1
2
3
proxmox-boot-tool format /dev/nvme0n1p2 --force
proxmox-boot-tool init /dev/nvme0n1p2
pvecm create HA-CLUSTER -ring0_addr 10.10.10.1

Subsequent nodes:

1
pvecm add 10.10.10.1 -force -ring0_addr 10.10.10.2

Ceph Deployment
Create OSDs with encryption:

1
2
ceph-volume lvm create --data /dev/sdb --dmcrypt
ceph-volume lvm create --data /dev/sdc --dmcrypt

Configure CRUSH map for rack-awareness:

1
2
3
ceph osd crush add-bucket rack1 rack
ceph osd crush move rack1 root=default
ceph osd crush move osd.0 rack=rack1

Docker with Overlay2 & ZFS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cat > /etc/docker/daemon.json <<EOF
{
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true",
    "overlay2.size=100G"
  ],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}
EOF

Service Deployment Example: Nextcloud
docker-compose.yml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
version: '3.7'

services:
  nextcloud:
    image: nextcloud:25.0.7-apache
    container_name: nextcloud
    networks:
      - frontend
      - backend
    volumes:
      - nextcloud:/var/www/html
      - /mnt/ceph/nextcloud:/var/www/html/data
    environment:
      - MYSQL_HOST=db
      - REDIS_HOST=redis
      - OVERWRITEPROTOCOL=https
    deploy:
      resources:
        limits:
          memory: 4G

  db:
    image: mariadb:10.11
    container_name: nextcloud-db
    networks:
      - backend
    volumes:
      - db:/var/lib/mysql
    environment:
      - MYSQL_ROOT_PASSWORD_FILE=/run/secrets/db_root_password
    secrets:
      - db_root_password

secrets:
  db_root_password:
    file: ./db_root_password.txt

networks:
  frontend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.22.0.0/24
  backend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.23.0.0/24

volumes:
  nextcloud:
    driver: ceph
    driver_opts:
      name: ceph
      pool: nextcloud
      volume: nextcloud-vol
      monitors: 10.10.10.1:6789,10.10.10.2:6789,10.10.10.3:6789
      secret: $CEPHX_SECRET
  db:
    driver: zfs
    driver_opts:
      size: 100G

Verification Steps
Check Ceph health:

1
2
3
4
ceph -s
  cluster:
    id:     a7f64266-6b9a-4b88-8b4d-362b0f1a2c7e
    health: HEALTH_OK

Test Docker volume:

1
docker run --rm -v nextcloud:/mnt alpine ls -l /mnt

5. Enterprise-Grade Configuration

Security Hardening

  1. Kernel Parameters:
    /etc/sysctl.d/99-hardening.conf:
    1
    2
    3
    
    net.ipv4.tcp_syncookies = 1
    net.ipv4.conf.all.rp_filter = 1
    kernel.kptr_restrict = 2
    
  2. AppArmor Profiles:
    nextcloud-profile:
    #include <tunables/global>
       
    profile nextcloud flags=(attach_disconnected) {
      #include <abstractions/apache2-common>
      /var/www/html/** r,
      /mnt/ceph/nextcloud/** rw,
      deny /var/www/html/data/*.php rwx,
    }
    

Performance Optimization
Ceph CRUSH Tunables:

1
2
ceph osd crush tunables optimal
ceph osd set-require-min-compat-client jewel

ZFS ARC Size Adjustment:

1
echo $((32 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max

Automated Certificate Management
Traefik dynamic configuration (dynamic.yml):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tls:
  certificates:
    - certFile: /etc/step/certs/site.crt
      keyFile: /etc/step/certs/site.key
  stores:
    default:
      defaultCertificate:
        certFile: /etc/step/certs/site.crt
        keyFile: /etc/step/certs/site.key

http:
  routers:
    nextcloud:
      rule: "Host(`cloud.example.com`)"
      service: nextcloud
      tls:
        certResolver: step

6. Operational Workflows

Daily Maintenance
Prune Docker resources:

1
docker system prune -af --volumes --filter "until=720h"

ZFS snapshot rotation:

1
2
zfs snap rpool/data@$(date +%Y%m%d)
zfs destroy -r rpool/data@$(date -d "30 days ago" +%Y%m%d)

Backup Strategy
BorgBackup to remote storage:

1
2
3
4
borg create --stats --progress \
  ssh://backup@nas01:22/mnt/backup/nextcloud::nextcloud-{now} \
  /mnt/ceph/nextcloud \
  --exclude '*.tmp'

Monitoring Stack
Prometheus scrape config:

1
2
3
4
5
6
7
8
scrape_configs:
  - job_name: 'proxmox'
    static_configs:
      - targets: ['10.10.10.1:9221', '10.10.10.2:9221']
  - job_name: 'ceph'
    metrics_path: /metrics
    static_configs:
      - targets: ['10.10.10.1:9283']

7. Troubleshooting Handbook

Common Failure Scenarios

SymptomDiagnostic CommandResolution
Ceph OSD downceph osd tree -f json-prettysystemctl restart ceph-osd@$ID
Container network failurensenter -t $PID -n ping 8.8.8.8Check iptables/Firewalld rules
ZFS pool degradedzpool status -vzpool replace pool bad-disk new-disk

Log Investigation
JournalCTL with time constraints:

1
journalctl --since "2023-07-15 09:00:00" --until "2023-07-15 12:00:00" -u ceph-mon@node1

Docker container inspection:

1
docker inspect $CONTAINER_ID --format '' | jq

8. Conclusion

After 18 months of operation, this self-hosted infrastructure handles:

  • 12TB of family photos/videos (Jellyfin)
  • 300GB of documents (Nextcloud)
  • 50,000+ emails (ProtonMail Bridge)

Critical lessons learned:

  1. Redundancy Is Non-Negotiable: Three-node minimum for any production service
  2. Automate or Perish: Unattended-upgrades + Ansible = survival
  3. Monitor Everything: A single failed OSD can cascade into pool failure

The stack currently achieves 99.82% uptime – not quite enterprise SLA, but sufficient for personal use. For those considering similar migrations: start with non-critical services, implement monitoring before migration, and always maintain offline backups.

Further Resources:

The journey to digital sovereignty requires technical rigor but delivers unparalleled control. As surveillance capitalism intensifies, the ability to maintain private infrastructure becomes not just a technical challenge, but an ethical imperative.

This post is licensed under CC BY 4.0 by the author.