Post

My Attempt At Replacing Cloud Services

My Attempt At Replacing Cloud Services

My Attempt At Replacing Cloud Services

1. Introduction

The growing tension between convenience and digital sovereignty has reached a tipping point. As major cloud providers intensify data collection practices under the guise of “free services,” technical professionals face a critical choice: continue feeding the surveillance machine or reclaim control through self-hosted infrastructure. This isn’t just another homelab experiment – it’s a technical deep dive into architecting a production-grade alternative to commercial cloud ecosystems.

For DevOps engineers and system administrators, replacing cloud services presents unique challenges: maintaining comparable availability, implementing enterprise-grade security, and achieving automation parity – all while avoiding the 24/7 operational overhead that makes cloud platforms appealing. This guide documents my multi-year journey building a privacy-focused infrastructure stack that handles email, file storage, media streaming, and productivity tools without corporate intermediaries.

You’ll learn how to:

  • Architect services with failure domains and redundancy using Proxmox VE clustering
  • Implement zero-trust networking with WireGuard and Tailscale
  • Containerize legacy applications using Docker/Podman without cloud dependencies
  • Automate TLS certificates with ACME challenges in isolated networks
  • Achieve 99.9% uptime using distributed storage (Ceph) and load balancing
  • Enforce GDPR-grade data controls without compliance theater

2. Understanding the Self-Hosted Paradigm Shift

What We’re Replacing
Commercial cloud ecosystems provide vertically integrated services:

1
[User Devices] → [Cloud Provider] → [Gmail/Drive/Photos/Calendar]

This creates critical vulnerabilities:

  1. Single point of control (provider terms of service)
  2. Data exfiltration via interconnected “free” services
  3. Limited configuration control (e.g., no custom retention policies)

The Self-Hosted Alternative

1
2
3
4
[User Devices] → [Reverse Proxy] → [NextCloud (Files)]  
                          → [ProtonMail Bridge (Email)]  
                          → [Jellyfin (Media)]  
                          → [Vaultwarden (Password Manager)]

Each component runs on dedicated infrastructure with explicit data boundaries.

Technical Tradeoffs

FactorCommercial CloudSelf-Hosted
Uptime SLA99.9-99.99%Depends on architecture
Storage Cost (TB/mo)$23 (GCP) - $40 (AWS)$5 (HDD) - $15 (SSD)
Security DefaultsAutomatic updatesManual patch management
Data ControlLimited (TOS-bound)Full cryptographic control

Key Technologies

  • Proxmox VE: Type-1 hypervisor with Kubernetes integration
  • Ceph: Distributed storage system with Erasure Coding
  • Ansible: Infrastructure-as-Code for configuration management
  • Traefik: Cloud-native edge router with Let’s Encrypt integration

When Self-Hosting Fails
Avoid core business systems requiring:

  • Global anycast networks (use Cloudflare DNS)
  • Petabyte-scale object storage (consider Backblaze B2)
  • AI/ML training clusters (limited GPU alternatives)

3. Prerequisites

Hardware Minimums

  • 3 Nodes (High Availability Cluster):
    • CPU: Xeon E3-1230v6 (4c/8t)
    • RAM: 32GB ECC DDR4
    • Storage: 2x NVMe (OS), 4x 8TB HDD (Ceph OSDs)
    • Network: 2x 10GbE (Storage/Public)

Pre-Installation Checklist

  1. Network Architecture:
    • VLAN segmentation (Management, Storage, Public)
    • BGP peering for anycast services (FRRouting)
    • Physical firewall (OPNsense/pfSense)
  2. Security Foundation:
    • Hardware Security Module (YubiHSM 2)
    • Offline certificate authority (Step CA)
    • Encrypted DNS (Unbound + DNS-over-TLS)
  3. Software Requirements:
    • Proxmox VE 7.4+ (no-subscription repo)
    • Ceph Quincy 17.2.6
    • Docker 20.10.23 with containerd
    • Ubuntu 22.04.3 LTS (Kernel 5.15 HWE)

Critical Configuration Files
/etc/apt/sources.list.d/proxmox.list:

1
deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription

/etc/ceph/ceph.conf:

1
2
3
4
[global]
osd_pool_default_size = 3
osd_pool_default_min_size = 2
mon_allow_pool_delete = true

4. Installation & High-Availability Configuration

Proxmox VE Cluster Initialization
First node:

1
2
3
proxmox-boot-tool format /dev/nvme0n1p2 --force
proxmox-boot-tool init /dev/nvme0n1p2
pvecm create HA-CLUSTER -ring0_addr 10.10.10.1

Subsequent nodes:

1
pvecm add 10.10.10.1 -force -ring0_addr 10.10.10.2

Ceph Deployment
Create OSDs with encryption:

1
2
ceph-volume lvm create --data /dev/sdb --dmcrypt
ceph-volume lvm create --data /dev/sdc --dmcrypt

Configure CRUSH map for rack-awareness:

1
2
3
ceph osd crush add-bucket rack1 rack
ceph osd crush move rack1 root=default
ceph osd crush move osd.0 rack=rack1

Docker with Overlay2 & ZFS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cat > /etc/docker/daemon.json <<EOF
{
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true",
    "overlay2.size=100G"
  ],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}
EOF

Service Deployment Example: Nextcloud
docker-compose.yml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
version: '3.7'

services:
  nextcloud:
    image: nextcloud:25.0.7-apache
    container_name: nextcloud
    networks:
      - frontend
      - backend
    volumes:
      - nextcloud:/var/www/html
      - /mnt/ceph/nextcloud:/var/www/html/data
    environment:
      - MYSQL_HOST=db
      - REDIS_HOST=redis
      - OVERWRITEPROTOCOL=https
    deploy:
      resources:
        limits:
          memory: 4G

  db:
    image: mariadb:10.11
    container_name: nextcloud-db
    networks:
      - backend
    volumes:
      - db:/var/lib/mysql
    environment:
      - MYSQL_ROOT_PASSWORD_FILE=/run/secrets/db_root_password
    secrets:
      - db_root_password

secrets:
  db_root_password:
    file: ./db_root_password.txt

networks:
  frontend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.22.0.0/24
  backend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.23.0.0/24

volumes:
  nextcloud:
    driver: ceph
    driver_opts:
      name: ceph
      pool: nextcloud
      volume: nextcloud-vol
      monitors: 10.10.10.1:6789,10.10.10.2:6789,10.10.10.3:6789
      secret: $CEPHX_SECRET
  db:
    driver: zfs
    driver_opts:
      size: 100G

Verification Steps
Check Ceph health:

1
2
3
4
ceph -s
  cluster:
    id:     a7f64266-6b9a-4b88-8b4d-362b0f1a2c7e
    health: HEALTH_OK

Test Docker volume:

1
docker run --rm -v nextcloud:/mnt alpine ls -l /mnt

5. Enterprise-Grade Configuration

Security Hardening

  1. Kernel Parameters:
    /etc/sysctl.d/99-hardening.conf:
    1
    2
    3
    
    net.ipv4.tcp_syncookies = 1
    net.ipv4.conf.all.rp_filter = 1
    kernel.kptr_restrict = 2
    
  2. AppArmor Profiles:
    nextcloud-profile:
    #include <tunables/global>
       
    profile nextcloud flags=(attach_disconnected) {
      #include <abstractions/apache2-common>
      /var/www/html/** r,
      /mnt/ceph/nextcloud/** rw,
      deny /var/www/html/data/*.php rwx,
    }
    

Performance Optimization
Ceph CRUSH Tunables:

1
2
ceph osd crush tunables optimal
ceph osd set-require-min-compat-client jewel

ZFS ARC Size Adjustment:

1
echo $((32 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max

Automated Certificate Management
Traefik dynamic configuration (dynamic.yml):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tls:
  certificates:
    - certFile: /etc/step/certs/site.crt
      keyFile: /etc/step/certs/site.key
  stores:
    default:
      defaultCertificate:
        certFile: /etc/step/certs/site.crt
        keyFile: /etc/step/certs/site.key

http:
  routers:
    nextcloud:
      rule: "Host(`cloud.example.com`)"
      service: nextcloud
      tls:
        certResolver: step

6. Operational Workflows

Daily Maintenance
Prune Docker resources:

1
docker system prune -af --volumes --filter "until=720h"

ZFS snapshot rotation:

1
2
zfs snap rpool/data@$(date +%Y%m%d)
zfs destroy -r rpool/data@$(date -d "30 days ago" +%Y%m%d)

Backup Strategy
BorgBackup to remote storage:

1
2
3
4
borg create --stats --progress \
  ssh://backup@nas01:22/mnt/backup/nextcloud::nextcloud-{now} \
  /mnt/ceph/nextcloud \
  --exclude '*.tmp'

Monitoring Stack
Prometheus scrape config:

1
2
3
4
5
6
7
8
scrape_configs:
  - job_name: 'proxmox'
    static_configs:
      - targets: ['10.10.10.1:9221', '10.10.10.2:9221']
  - job_name: 'ceph'
    metrics_path: /metrics
    static_configs:
      - targets: ['10.10.10.1:9283']

7. Troubleshooting Handbook

Common Failure Scenarios

SymptomDiagnostic CommandResolution
Ceph OSD downceph osd tree -f json-prettysystemctl restart ceph-osd@$ID
Container network failurensenter -t $PID -n ping 8.8.8.8Check iptables/Firewalld rules
ZFS pool degradedzpool status -vzpool replace pool bad-disk new-disk

Log Investigation
JournalCTL with time constraints:

1
journalctl --since "2023-07-15 09:00:00" --until "2023-07-15 12:00:00" -u ceph-mon@node1

Docker container inspection:

1
2
```bash
docker inspect $CONTAINER_ID --format '' | jq

```

8. Conclusion

After 18 months of operation, this self-hosted infrastructure handles:

  • 12TB of family photos/videos (Jellyfin)
  • 300GB of documents (Nextcloud)
  • 50,000+ emails (ProtonMail Bridge)

Critical lessons learned:

  1. Redundancy Is Non-Negotiable: Three-node minimum for any production service
  2. Automate or Perish: Unattended-upgrades + Ansible = survival
  3. Monitor Everything: A single failed OSD can cascade into pool failure

The stack currently achieves 99.82% uptime – not quite enterprise SLA, but sufficient for personal use. For those considering similar migrations: start with non-critical services, implement monitoring before migration, and always maintain offline backups.

Further Resources:

The journey to digital sovereignty requires technical rigor but delivers unparalleled control. As surveillance capitalism intensifies, the ability to maintain private infrastructure becomes not just a technical challenge, but an ethical imperative.

This post is licensed under CC BY 4.0 by the author.