My Attempt At Replacing Cloud Services
My Attempt At Replacing Cloud Services
1. Introduction
The growing tension between convenience and digital sovereignty has reached a tipping point. As major cloud providers intensify data collection practices under the guise of “free services,” technical professionals face a critical choice: continue feeding the surveillance machine or reclaim control through self-hosted infrastructure. This isn’t just another homelab experiment – it’s a technical deep dive into architecting a production-grade alternative to commercial cloud ecosystems.
For DevOps engineers and system administrators, replacing cloud services presents unique challenges: maintaining comparable availability, implementing enterprise-grade security, and achieving automation parity – all while avoiding the 24/7 operational overhead that makes cloud platforms appealing. This guide documents my multi-year journey building a privacy-focused infrastructure stack that handles email, file storage, media streaming, and productivity tools without corporate intermediaries.
You’ll learn how to:
- Architect services with failure domains and redundancy using Proxmox VE clustering
- Implement zero-trust networking with WireGuard and Tailscale
- Containerize legacy applications using Docker/Podman without cloud dependencies
- Automate TLS certificates with ACME challenges in isolated networks
- Achieve 99.9% uptime using distributed storage (Ceph) and load balancing
- Enforce GDPR-grade data controls without compliance theater
2. Understanding the Self-Hosted Paradigm Shift
What We’re Replacing
Commercial cloud ecosystems provide vertically integrated services:
1
[User Devices] → [Cloud Provider] → [Gmail/Drive/Photos/Calendar]
This creates critical vulnerabilities:
- Single point of control (provider terms of service)
- Data exfiltration via interconnected “free” services
- Limited configuration control (e.g., no custom retention policies)
The Self-Hosted Alternative
1
2
3
4
[User Devices] → [Reverse Proxy] → [NextCloud (Files)]
→ [ProtonMail Bridge (Email)]
→ [Jellyfin (Media)]
→ [Vaultwarden (Password Manager)]
Each component runs on dedicated infrastructure with explicit data boundaries.
Technical Tradeoffs
Factor | Commercial Cloud | Self-Hosted |
---|---|---|
Uptime SLA | 99.9-99.99% | Depends on architecture |
Storage Cost (TB/mo) | $23 (GCP) - $40 (AWS) | $5 (HDD) - $15 (SSD) |
Security Defaults | Automatic updates | Manual patch management |
Data Control | Limited (TOS-bound) | Full cryptographic control |
Key Technologies
- Proxmox VE: Type-1 hypervisor with Kubernetes integration
- Ceph: Distributed storage system with Erasure Coding
- Ansible: Infrastructure-as-Code for configuration management
- Traefik: Cloud-native edge router with Let’s Encrypt integration
When Self-Hosting Fails
Avoid core business systems requiring:
- Global anycast networks (use Cloudflare DNS)
- Petabyte-scale object storage (consider Backblaze B2)
- AI/ML training clusters (limited GPU alternatives)
3. Prerequisites
Hardware Minimums
- 3 Nodes (High Availability Cluster):
- CPU: Xeon E3-1230v6 (4c/8t)
- RAM: 32GB ECC DDR4
- Storage: 2x NVMe (OS), 4x 8TB HDD (Ceph OSDs)
- Network: 2x 10GbE (Storage/Public)
Pre-Installation Checklist
- Network Architecture:
- VLAN segmentation (Management, Storage, Public)
- BGP peering for anycast services (FRRouting)
- Physical firewall (OPNsense/pfSense)
- Security Foundation:
- Hardware Security Module (YubiHSM 2)
- Offline certificate authority (Step CA)
- Encrypted DNS (Unbound + DNS-over-TLS)
- Software Requirements:
- Proxmox VE 7.4+ (
no-subscription
repo) - Ceph Quincy 17.2.6
- Docker 20.10.23 with containerd
- Ubuntu 22.04.3 LTS (Kernel 5.15 HWE)
- Proxmox VE 7.4+ (
Critical Configuration Files
/etc/apt/sources.list.d/proxmox.list
:
1
deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription
/etc/ceph/ceph.conf
:
1
2
3
4
[global]
osd_pool_default_size = 3
osd_pool_default_min_size = 2
mon_allow_pool_delete = true
4. Installation & High-Availability Configuration
Proxmox VE Cluster Initialization
First node:
1
2
3
proxmox-boot-tool format /dev/nvme0n1p2 --force
proxmox-boot-tool init /dev/nvme0n1p2
pvecm create HA-CLUSTER -ring0_addr 10.10.10.1
Subsequent nodes:
1
pvecm add 10.10.10.1 -force -ring0_addr 10.10.10.2
Ceph Deployment
Create OSDs with encryption:
1
2
ceph-volume lvm create --data /dev/sdb --dmcrypt
ceph-volume lvm create --data /dev/sdc --dmcrypt
Configure CRUSH map for rack-awareness:
1
2
3
ceph osd crush add-bucket rack1 rack
ceph osd crush move rack1 root=default
ceph osd crush move osd.0 rack=rack1
Docker with Overlay2 & ZFS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
cat > /etc/docker/daemon.json <<EOF
{
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true",
"overlay2.size=100G"
],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3"
}
}
EOF
Service Deployment Example: Nextcloud
docker-compose.yml
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
version: '3.7'
services:
nextcloud:
image: nextcloud:25.0.7-apache
container_name: nextcloud
networks:
- frontend
- backend
volumes:
- nextcloud:/var/www/html
- /mnt/ceph/nextcloud:/var/www/html/data
environment:
- MYSQL_HOST=db
- REDIS_HOST=redis
- OVERWRITEPROTOCOL=https
deploy:
resources:
limits:
memory: 4G
db:
image: mariadb:10.11
container_name: nextcloud-db
networks:
- backend
volumes:
- db:/var/lib/mysql
environment:
- MYSQL_ROOT_PASSWORD_FILE=/run/secrets/db_root_password
secrets:
- db_root_password
secrets:
db_root_password:
file: ./db_root_password.txt
networks:
frontend:
driver: bridge
ipam:
config:
- subnet: 172.22.0.0/24
backend:
driver: bridge
ipam:
config:
- subnet: 172.23.0.0/24
volumes:
nextcloud:
driver: ceph
driver_opts:
name: ceph
pool: nextcloud
volume: nextcloud-vol
monitors: 10.10.10.1:6789,10.10.10.2:6789,10.10.10.3:6789
secret: $CEPHX_SECRET
db:
driver: zfs
driver_opts:
size: 100G
Verification Steps
Check Ceph health:
1
2
3
4
ceph -s
cluster:
id: a7f64266-6b9a-4b88-8b4d-362b0f1a2c7e
health: HEALTH_OK
Test Docker volume:
1
docker run --rm -v nextcloud:/mnt alpine ls -l /mnt
5. Enterprise-Grade Configuration
Security Hardening
- Kernel Parameters:
/etc/sysctl.d/99-hardening.conf
:1 2 3
net.ipv4.tcp_syncookies = 1 net.ipv4.conf.all.rp_filter = 1 kernel.kptr_restrict = 2
- AppArmor Profiles:
nextcloud-profile
:#include <tunables/global> profile nextcloud flags=(attach_disconnected) { #include <abstractions/apache2-common> /var/www/html/** r, /mnt/ceph/nextcloud/** rw, deny /var/www/html/data/*.php rwx, }
Performance Optimization
Ceph CRUSH Tunables:
1
2
ceph osd crush tunables optimal
ceph osd set-require-min-compat-client jewel
ZFS ARC Size Adjustment:
1
echo $((32 * 1024 * 1024 * 1024)) > /sys/module/zfs/parameters/zfs_arc_max
Automated Certificate Management
Traefik dynamic configuration (dynamic.yml
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tls:
certificates:
- certFile: /etc/step/certs/site.crt
keyFile: /etc/step/certs/site.key
stores:
default:
defaultCertificate:
certFile: /etc/step/certs/site.crt
keyFile: /etc/step/certs/site.key
http:
routers:
nextcloud:
rule: "Host(`cloud.example.com`)"
service: nextcloud
tls:
certResolver: step
6. Operational Workflows
Daily Maintenance
Prune Docker resources:
1
docker system prune -af --volumes --filter "until=720h"
ZFS snapshot rotation:
1
2
zfs snap rpool/data@$(date +%Y%m%d)
zfs destroy -r rpool/data@$(date -d "30 days ago" +%Y%m%d)
Backup Strategy
BorgBackup to remote storage:
1
2
3
4
borg create --stats --progress \
ssh://backup@nas01:22/mnt/backup/nextcloud::nextcloud-{now} \
/mnt/ceph/nextcloud \
--exclude '*.tmp'
Monitoring Stack
Prometheus scrape config:
1
2
3
4
5
6
7
8
scrape_configs:
- job_name: 'proxmox'
static_configs:
- targets: ['10.10.10.1:9221', '10.10.10.2:9221']
- job_name: 'ceph'
metrics_path: /metrics
static_configs:
- targets: ['10.10.10.1:9283']
7. Troubleshooting Handbook
Common Failure Scenarios
Symptom | Diagnostic Command | Resolution |
---|---|---|
Ceph OSD down | ceph osd tree -f json-pretty | systemctl restart ceph-osd@$ID |
Container network failure | nsenter -t $PID -n ping 8.8.8.8 | Check iptables/Firewalld rules |
ZFS pool degraded | zpool status -v | zpool replace pool bad-disk new-disk |
Log Investigation
JournalCTL with time constraints:
1
journalctl --since "2023-07-15 09:00:00" --until "2023-07-15 12:00:00" -u ceph-mon@node1
Docker container inspection:
1
docker inspect $CONTAINER_ID --format '' | jq
8. Conclusion
After 18 months of operation, this self-hosted infrastructure handles:
- 12TB of family photos/videos (Jellyfin)
- 300GB of documents (Nextcloud)
- 50,000+ emails (ProtonMail Bridge)
Critical lessons learned:
- Redundancy Is Non-Negotiable: Three-node minimum for any production service
- Automate or Perish: Unattended-upgrades + Ansible = survival
- Monitor Everything: A single failed OSD can cascade into pool failure
The stack currently achieves 99.82% uptime – not quite enterprise SLA, but sufficient for personal use. For those considering similar migrations: start with non-critical services, implement monitoring before migration, and always maintain offline backups.
Further Resources:
- Proxmox Cluster Manager Documentation
- Ceph Erasure Coding Profiles
- BorgBackup Practical Examples
- Traefik ACME Configuration
The journey to digital sovereignty requires technical rigor but delivers unparalleled control. As surveillance capitalism intensifies, the ability to maintain private infrastructure becomes not just a technical challenge, but an ethical imperative.