Immich Needs Our Help

Posted Jan 31, 2026

By Usman Masood Ashraf

views 9 min read

Immich Needs Our Help: Building the Future of Self-Hosted Photo Management

Introduction

In the evolving landscape of self-hosted photo management, Immich has emerged as a leading open-source alternative to proprietary solutions like Google Photos. However, there’s a critical challenge threatening its progress: the need for a comprehensive public EXIF dataset to train its metadata parsing capabilities. As DevOps engineers and system administrators responsible for homelab environments and production deployments, our technical expertise is needed to solve this infrastructure-level challenge.

The problem is both technical and architectural: Accurate EXIF metadata parsing requires diverse training data from various camera models, smartphone manufacturers, and shooting conditions. Without this dataset, Immich’s ability to properly organize photos based on embedded metadata remains limited - a crucial feature for any serious photo management solution in self-hosted environments.

This technical deep dive will explore:

The infrastructure implications of Immich’s EXIF dataset initiative
Secure data contribution methods for DevOps professionals
Metadata handling best practices for homelab environments
The long-term architectural benefits to Immich’s ecosystem

For those managing self-hosted photo solutions, participating in this dataset creation isn’t just about community support - it’s an investment in the future robustness of your own Immich deployments. The quality of metadata processing directly impacts search accuracy, organization capabilities, and ultimately the user experience of your photo management infrastructure.

Understanding the EXIF Dataset Challenge

What Is Immich?

Immich is an open-source, self-hosted photo management solution designed as a Google Photos alternative. Its architecture consists of several microservices:

Web UI: React-based frontend
Server: NestJS backend with TypeScript
Machine Learning: TensorFlow-based image recognition
Database: PostgreSQL for metadata storage
Storage: Filesystem or S3-compatible object storage

Key technical capabilities include:

Automated album creation through ML
Facial recognition
Metadata-based organization
Cross-device synchronization

The Metadata Parsing Problem

EXIF (Exchangeable Image File Format) metadata contains crucial information about digital photos:

Camera make/model
Geographic coordinates
Exposure settings
Creation timestamps
Copyright information

The challenge arises from:

Vendor Implementation Differences: Each manufacturer implements EXIF standards differently
File Format Variations: RAW formats (.CR2, .NEF, .ARW) contain proprietary metadata structures
Smartphone Complexity: Modern devices embed additional metadata in non-standard locations

Without a comprehensive training dataset, Immich’s parsing engine cannot properly handle these variations, leading to:

Incorrect date/time assignments
Missing geolocation data
Improper camera model identification
Loss of critical photographic metadata

The Dataset Infrastructure Requirements

Building an effective EXIF dataset requires:

Data Diversity:
- Minimum 10,000 images from different sources
- Representation from 100+ device manufacturers
- Multiple file formats (JPEG, HEIC, RAW variants)
Processing Infrastructure:
- Secure upload pipeline with checksum verification
- EXIF extraction automation
- Anonymization pipeline for sensitive data
- Distributed storage for the resulting dataset
Compliance Considerations:
- GDPR-compliant data handling
- CCPA privacy requirements
- Proper license structuring (CC BY 4.0 recommended)

DevOps Implications

For infrastructure engineers, this initiative presents several technical challenges:

Data Pipeline Architecture:

  
# Simplified EXIF processing pipeline
def process_upload(image):
 # Step 1: Verify file integrity
 verify_checksum(image)
    
 # Step 2: Extract metadata
 exif_data = extract_exif(image)
    
 # Step 3: Anonymize sensitive fields
 sanitized_data = anonymize(exif_data)
    
 # Step 4: Store in dataset format
 store_in_dataset(sanitized_data)
    
 # Step 5: Publish to public repository
 publish_dataset_update()

Storage Requirements:
- Initial dataset: ~500GB (50,000 images @ 10MB/image)
- Growth projection: 2TB/year with ongoing contributions
Processing Workload:
- Batch processing needs: 100 vCPU cluster for initial processing
- Ongoing maintenance: Serverless architecture for new submissions

Prerequisites for Technical Contribution

Data Preparation Requirements

Before contributing photos to the EXIF dataset, DevOps professionals should implement the following sanitization pipeline:

Isolation Environment:

  
# Create secure processing environment
docker run -it --rm --name immich-exif-processing \
  -v $PWD/input:/input \
  -v $PWD/output:/output \
  ubuntu:22.04 bash

Required Tools:
- ExifTool 12.60+
- jq 1.6+
- ImageMagick 7.1+
Sanitization Process:

  
# Sample sanitization script
for file in /input/*.{jpg,jpeg,heic}; do
  # Remove personal metadata
  exiftool -all= -tagsfromfile @ -exif:all "$file"
  
  # Strip GPS coordinates
  exiftool -gps:all= "$file"
  
  # Remove any remaining private tags
  exiftool -xmp:all= -iptc:all= "$file"
  
  # Verify sanitization
  exiftool "$file" | grep -i -E 'gps|location|creator|copyright'
  
  # Move to output directory
  mv "${file}_original" /output/$(sha256sum "$file" | cut -d' ' -f1).${file##*.}
done

Security Considerations

When contributing to public datasets:

Network Isolation:
- Process images on non-routable network segment
- Use temporary cloud instances for processing
Storage Encryption:
- Encrypt at rest with LUKS or similar
- Use temporary storage with secure erasure
Content Verification:
- Implement SHA-256 checksum verification
- Use GPG-signed manifests for contributed data

Legal Compliance Checklist

Confirm images contain no personally identifiable information
Verify no copyrighted material included
Ensure proper CC BY 4.0 licensing applied
Document data provenance and processing steps

Infrastructure Setup for Dataset Contribution

Secure Upload Environment

For enterprise-scale contributions, set up a dedicated submission environment:

Network Architecture:

                       +---------------------+
                       |   Reverse Proxy     |
                       | (TLS Termination)   |
                       +----------+----------+
                                  |
                       +----------v----------+
                       |   Upload Gateway    |
                       | (Rate Limiting)     |
                       +----------+----------+
                                  |
                       +----------v----------+
                       |   Sanitization      |
                       |   Microservice      |
                       +----------+----------+
                                  |
                       +----------v----------+
                       |   Checksum          |
                       |   Verification      |
                       +----------+----------+
                                  |
                       +----------v----------+
                       |   Dataset Storage   |
                       | (Immutable)         |
                       +---------------------+

Dockerized Processing: ```yaml
docker-compose.yml for upload processing
version: ‘3.8’

services: uploader: image: nginx:1.23-alpine volumes: - ./uploads:/tmp/uploads ports: - “8080:80”

processor: image: immich/exif-processor:latest environment: - SANITIZATION_LEVEL=strict - OUTPUT_FORMAT=parquet volumes: - ./uploads:/input - ./dataset:/output

verifier: image: immich/checksum-validator:latest volumes: - ./dataset:/input depends_on: - processor

### Contribution Workflow

For DevOps teams managing large photo archives:

1. **Enterprise Contribution Process**:
```mermaid
graph TB
    A[Identify Source Archives] --> B[Initial Metadata Scan]
    B --> C{Contains Sensitive Data?}
    C -->|Yes| D[Sanitization Pipeline]
    C -->|No| E[Checksum Generation]
    D --> E
    E --> F[Dataset Packaging]
    F --> G[Secure Transfer]
    G --> H[Immich Dataset Repository]

Automated Contribution Script: ```bash #!/bin/bash
Automated EXIF dataset contribution script

SOURCE_DIR=”/archive/photos” OUTPUT_DIR=”/dataset/contribution” TEMP_DIR=$(mktemp -d)

Process RAW files

find “$SOURCE_DIR” -type f ( -iname *.cr2 -o -iname *.nef ) -print0 |
while IFS= read -r -d $’\0’ file; do # Convert to DNG for standard format docker run –rm -v “$(dirname “$file”)”:/input -v “$TEMP_DIR”:/output
adobe/dng_converter -d /output /input/”$(basename “$file”)”

# Extract metadata
exiftool -json /output/"${file%.*}.dng" > "$OUTPUT_DIR/${file##*/}.json"   done

Generate manifest

jq -n ‘{ files: [ inputs ] }’ “$OUTPUT_DIR”/*.json > manifest.json

Create checksum

sha256sum “$OUTPUT_DIR”/* | gpg –clearsign > manifest.sha256

Upload package

curl -X POST -H “Authorization: Bearer $API_TOKEN”
-F “manifest=@manifest.json”
-F “data=@$OUTPUT_DIR/*”
https://datasets.immich.app/upload

## Configuration and Optimization

### Dataset Processing Optimization

For large-scale contributions, optimize processing with:

1. **Parallel EXIF Extraction**:
```bash
# GNU Parallel processing example
find /photo_archive -type f -name '*.jpg' | parallel -j 8 \
  'exiftool -j {} > /output/{/.}.json'

Distributed Processing: ```yaml
Kubernetes Job configuration for EXIF processing
apiVersion: batch/v1 kind: Job metadata: name: exif-processor spec: parallelism: 20 completions: 1000 template: spec: containers:
- name: processor image: exiftool:latest command: [“/process.sh”] volumeMounts:
  - name: photos mountPath: /input
  - name: output mountPath: /output restartPolicy: Never volumes:
- name: photos persistentVolumeClaim: claimName: photo-archive
- name: output persistentVolumeClaim: claimName: dataset-output ```

Security Hardening

For dataset contributions containing potentially sensitive metadata:

Metadata Firewall Rules:

  
# iptables rules for processing network
iptables -A OUTPUT -p tcp --dport 443 -d datasets.immich.app -j ACCEPT
iptables -A OUTPUT -p tcp --dport 80 -j DROP
iptables -A OUTPUT -p tcp --dport 443 -j DROP

Mandatory Access Controls: ```bash
AppArmor profile for EXIF processing
#include <tunables/global>

/profile/immich-exif flags=(attach_disconnected) { # Allow read-only access to input /input/** r,

# Allow writing to output /output/** rw,

# Block network access deny network, }

## Operational Best Practices

### Dataset Maintenance Procedures

For teams managing ongoing contributions:

1. **Validation Workflow**:
```python
# Dataset validation script
import json
import hashlib

def validate_contribution(manifest_path):
    with open(manifest_path) as f:
        manifest = json.load(f)
    
    for entry in manifest['files']:
        # Verify file integrity
        with open(entry['path'], 'rb') as f:
            sha = hashlib.sha256(f.read()).hexdigest()
            if sha != entry['checksum']:
                raise ValueError(f"Checksum mismatch for {entry['path']}")
        
        # Validate metadata completeness
        required_fields = ['Make', 'Model', 'DateTimeOriginal']
        for field in required_fields:
            if field not in entry['exif']:
                print(f"Warning: Missing {field} in {entry['path']}")

Automated Quality Checks:

  
# Quality control checks
exiftool -csv -r /dataset | awk -F, '
BEGIN {
  OFS=","
  print "File,Make,Model,DateTimeOriginal"
}
NR>1 {
  if ($3 == "" || $4 == "" || $6 == "") {
 print $1,"MISSING DATA" > "/dev/stderr"
 exit 1
  }
}'

Backup Strategy

For dataset curators:

3-2-1 Backup Methodology:
- 3 copies: Primary + two backups
- 2 media types: Object storage + tape
- 1 offsite: Geographical separation

Immutable Backups:

  
# MinIO bucket policy for dataset storage
{
  "Version": "2012-10-17",
  "Statement": [
 {
   "Effect": "Deny",
   "Action": ["s3:DeleteObject"],
   "Principal": "*",
   "Resource": ["arn:aws:s3:::immich-dataset/*"],
   "Condition": {
     "NumericLessThan": {"aws:CurrentTime": "2026-01-01T00:00:00Z"}
   }
 }
  ]
}

Troubleshooting Common Issues

Dataset Contribution Challenges

Metadata Persistence After Sanitization:

  
# Verify complete removal of GPS data
exiftool -gps:all -filename -directory -r /processed_files | grep -v "GPS"

Invalid File Formats:

  
# Batch conversion to compatible formats
find . -type f -name '*.heic' -exec docker run --rm -v $PWD:/input \
  tianon/exiftool -tagsFromFile @ -all:all > /dev/null 2>&1 {} \;

Corrupted Metadata:

  
# Repair broken EXIF structures
exiftool -all= -tagsfromfile @ -all:all -unsafe bad_image.jpg

Performance Tuning

For large dataset processing:

I/O Optimization:

  
# Mount options for processing storage
mount -o noatime,nodiratime,data=writeback /dev/sdb /dataset

Memory Caching:

  
# Use vmtouch to cache metadata files
vmtouch -t -m 16G /dataset/metadata/

Conclusion

The Immich EXIF dataset initiative represents a critical infrastructure project for the future of self-hosted photo management. As DevOps professionals, we have both the technical expertise and operational discipline to contribute high-quality, secure metadata that will benefit the entire Immich

Open Source, Reddit Guides, Kubernetes

This post is licensed under CC BY 4.0 by the author.