Post

Immich Needs Our Help

Immich Needs Our Help

Immich Needs Our Help: Building the Future of Self-Hosted Photo Management

Introduction

In the evolving landscape of self-hosted photo management, Immich has emerged as a leading open-source alternative to proprietary solutions like Google Photos. However, there’s a critical challenge threatening its progress: the need for a comprehensive public EXIF dataset to train its metadata parsing capabilities. As DevOps engineers and system administrators responsible for homelab environments and production deployments, our technical expertise is needed to solve this infrastructure-level challenge.

The problem is both technical and architectural: Accurate EXIF metadata parsing requires diverse training data from various camera models, smartphone manufacturers, and shooting conditions. Without this dataset, Immich’s ability to properly organize photos based on embedded metadata remains limited - a crucial feature for any serious photo management solution in self-hosted environments.

This technical deep dive will explore:

  1. The infrastructure implications of Immich’s EXIF dataset initiative
  2. Secure data contribution methods for DevOps professionals
  3. Metadata handling best practices for homelab environments
  4. The long-term architectural benefits to Immich’s ecosystem

For those managing self-hosted photo solutions, participating in this dataset creation isn’t just about community support - it’s an investment in the future robustness of your own Immich deployments. The quality of metadata processing directly impacts search accuracy, organization capabilities, and ultimately the user experience of your photo management infrastructure.

Understanding the EXIF Dataset Challenge

What Is Immich?

Immich is an open-source, self-hosted photo management solution designed as a Google Photos alternative. Its architecture consists of several microservices:

  • Web UI: React-based frontend
  • Server: NestJS backend with TypeScript
  • Machine Learning: TensorFlow-based image recognition
  • Database: PostgreSQL for metadata storage
  • Storage: Filesystem or S3-compatible object storage

Key technical capabilities include:

  • Automated album creation through ML
  • Facial recognition
  • Metadata-based organization
  • Cross-device synchronization

The Metadata Parsing Problem

EXIF (Exchangeable Image File Format) metadata contains crucial information about digital photos:

  • Camera make/model
  • Geographic coordinates
  • Exposure settings
  • Creation timestamps
  • Copyright information

The challenge arises from:

  1. Vendor Implementation Differences: Each manufacturer implements EXIF standards differently
  2. File Format Variations: RAW formats (.CR2, .NEF, .ARW) contain proprietary metadata structures
  3. Smartphone Complexity: Modern devices embed additional metadata in non-standard locations

Without a comprehensive training dataset, Immich’s parsing engine cannot properly handle these variations, leading to:

  • Incorrect date/time assignments
  • Missing geolocation data
  • Improper camera model identification
  • Loss of critical photographic metadata

The Dataset Infrastructure Requirements

Building an effective EXIF dataset requires:

  1. Data Diversity:
    • Minimum 10,000 images from different sources
    • Representation from 100+ device manufacturers
    • Multiple file formats (JPEG, HEIC, RAW variants)
  2. Processing Infrastructure:
    • Secure upload pipeline with checksum verification
    • EXIF extraction automation
    • Anonymization pipeline for sensitive data
    • Distributed storage for the resulting dataset
  3. Compliance Considerations:
    • GDPR-compliant data handling
    • CCPA privacy requirements
    • Proper license structuring (CC BY 4.0 recommended)

DevOps Implications

For infrastructure engineers, this initiative presents several technical challenges:

  1. Data Pipeline Architecture:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    # Simplified EXIF processing pipeline
    def process_upload(image):
     # Step 1: Verify file integrity
     verify_checksum(image)
        
     # Step 2: Extract metadata
     exif_data = extract_exif(image)
        
     # Step 3: Anonymize sensitive fields
     sanitized_data = anonymize(exif_data)
        
     # Step 4: Store in dataset format
     store_in_dataset(sanitized_data)
        
     # Step 5: Publish to public repository
     publish_dataset_update()
    
  2. Storage Requirements:
    • Initial dataset: ~500GB (50,000 images @ 10MB/image)
    • Growth projection: 2TB/year with ongoing contributions
  3. Processing Workload:
    • Batch processing needs: 100 vCPU cluster for initial processing
    • Ongoing maintenance: Serverless architecture for new submissions

Prerequisites for Technical Contribution

Data Preparation Requirements

Before contributing photos to the EXIF dataset, DevOps professionals should implement the following sanitization pipeline:

  1. Isolation Environment:
    1
    2
    3
    4
    5
    
    # Create secure processing environment
    docker run -it --rm --name immich-exif-processing \
      -v $PWD/input:/input \
      -v $PWD/output:/output \
      ubuntu:22.04 bash
    
  2. Required Tools:
    • ExifTool 12.60+
    • jq 1.6+
    • ImageMagick 7.1+
  3. Sanitization Process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Sample sanitization script
for file in /input/*.{jpg,jpeg,heic}; do
  # Remove personal metadata
  exiftool -all= -tagsfromfile @ -exif:all "$file"
  
  # Strip GPS coordinates
  exiftool -gps:all= "$file"
  
  # Remove any remaining private tags
  exiftool -xmp:all= -iptc:all= "$file"
  
  # Verify sanitization
  exiftool "$file" | grep -i -E 'gps|location|creator|copyright'
  
  # Move to output directory
  mv "${file}_original" /output/$(sha256sum "$file" | cut -d' ' -f1).${file##*.}
done

Security Considerations

When contributing to public datasets:

  1. Network Isolation:
    • Process images on non-routable network segment
    • Use temporary cloud instances for processing
  2. Storage Encryption:
    • Encrypt at rest with LUKS or similar
    • Use temporary storage with secure erasure
  3. Content Verification:
    • Implement SHA-256 checksum verification
    • Use GPG-signed manifests for contributed data
  • Confirm images contain no personally identifiable information
  • Verify no copyrighted material included
  • Ensure proper CC BY 4.0 licensing applied
  • Document data provenance and processing steps

Infrastructure Setup for Dataset Contribution

Secure Upload Environment

For enterprise-scale contributions, set up a dedicated submission environment:

  1. Network Architecture:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
                           +---------------------+
                           |   Reverse Proxy     |
                           | (TLS Termination)   |
                           +----------+----------+
                                      |
                           +----------v----------+
                           |   Upload Gateway    |
                           | (Rate Limiting)     |
                           +----------+----------+
                                      |
                           +----------v----------+
                           |   Sanitization      |
                           |   Microservice      |
                           +----------+----------+
                                      |
                           +----------v----------+
                           |   Checksum          |
                           |   Verification      |
                           +----------+----------+
                                      |
                           +----------v----------+
                           |   Dataset Storage   |
                           | (Immutable)         |
                           +---------------------+
    
  2. Dockerized Processing: ```yaml

    docker-compose.yml for upload processing

    version: ‘3.8’

services: uploader: image: nginx:1.23-alpine volumes: - ./uploads:/tmp/uploads ports: - “8080:80”

processor: image: immich/exif-processor:latest environment: - SANITIZATION_LEVEL=strict - OUTPUT_FORMAT=parquet volumes: - ./uploads:/input - ./dataset:/output

verifier: image: immich/checksum-validator:latest volumes: - ./dataset:/input depends_on: - processor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
### Contribution Workflow

For DevOps teams managing large photo archives:

1. **Enterprise Contribution Process**:
```mermaid
graph TB
    A[Identify Source Archives] --> B[Initial Metadata Scan]
    B --> C{Contains Sensitive Data?}
    C -->|Yes| D[Sanitization Pipeline]
    C -->|No| E[Checksum Generation]
    D --> E
    E --> F[Dataset Packaging]
    F --> G[Secure Transfer]
    G --> H[Immich Dataset Repository]
  1. Automated Contribution Script: ```bash #!/bin/bash

    Automated EXIF dataset contribution script

SOURCE_DIR=”/archive/photos” OUTPUT_DIR=”/dataset/contribution” TEMP_DIR=$(mktemp -d)

Process RAW files

find “$SOURCE_DIR” -type f ( -iname *.cr2 -o -iname *.nef ) -print0 |
while IFS= read -r -d $’\0’ file; do # Convert to DNG for standard format docker run –rm -v “$(dirname “$file”)”:/input -v “$TEMP_DIR”:/output
adobe/dng_converter -d /output /input/”$(basename “$file”)”

1
2
# Extract metadata
exiftool -json /output/"${file%.*}.dng" > "$OUTPUT_DIR/${file##*/}.json"   done

Generate manifest

jq -n ‘{ files: [ inputs ] }’ “$OUTPUT_DIR”/*.json > manifest.json

Create checksum

sha256sum “$OUTPUT_DIR”/* | gpg –clearsign > manifest.sha256

Upload package

curl -X POST -H “Authorization: Bearer $API_TOKEN”
-F “manifest=@manifest.json”
-F “data=@$OUTPUT_DIR/*”
https://datasets.immich.app/upload

1
2
3
4
5
6
7
8
9
10
11
12
## Configuration and Optimization

### Dataset Processing Optimization

For large-scale contributions, optimize processing with:

1. **Parallel EXIF Extraction**:
```bash
# GNU Parallel processing example
find /photo_archive -type f -name '*.jpg' | parallel -j 8 \
  'exiftool -j {} > /output/{/.}.json'
  1. Distributed Processing: ```yaml

    Kubernetes Job configuration for EXIF processing

    apiVersion: batch/v1 kind: Job metadata: name: exif-processor spec: parallelism: 20 completions: 1000 template: spec: containers:

    • name: processor image: exiftool:latest command: [“/process.sh”] volumeMounts:
      • name: photos mountPath: /input
      • name: output mountPath: /output restartPolicy: Never volumes:
    • name: photos persistentVolumeClaim: claimName: photo-archive
    • name: output persistentVolumeClaim: claimName: dataset-output ```

Security Hardening

For dataset contributions containing potentially sensitive metadata:

  1. Metadata Firewall Rules:
    1
    2
    3
    4
    
    # iptables rules for processing network
    iptables -A OUTPUT -p tcp --dport 443 -d datasets.immich.app -j ACCEPT
    iptables -A OUTPUT -p tcp --dport 80 -j DROP
    iptables -A OUTPUT -p tcp --dport 443 -j DROP
    
  2. Mandatory Access Controls: ```bash

    AppArmor profile for EXIF processing

    #include <tunables/global>

/profile/immich-exif flags=(attach_disconnected) { # Allow read-only access to input /input/** r,

# Allow writing to output /output/** rw,

# Block network access deny network, }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Operational Best Practices

### Dataset Maintenance Procedures

For teams managing ongoing contributions:

1. **Validation Workflow**:
```python
# Dataset validation script
import json
import hashlib

def validate_contribution(manifest_path):
    with open(manifest_path) as f:
        manifest = json.load(f)
    
    for entry in manifest['files']:
        # Verify file integrity
        with open(entry['path'], 'rb') as f:
            sha = hashlib.sha256(f.read()).hexdigest()
            if sha != entry['checksum']:
                raise ValueError(f"Checksum mismatch for {entry['path']}")
        
        # Validate metadata completeness
        required_fields = ['Make', 'Model', 'DateTimeOriginal']
        for field in required_fields:
            if field not in entry['exif']:
                print(f"Warning: Missing {field} in {entry['path']}")
  1. Automated Quality Checks:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    # Quality control checks
    exiftool -csv -r /dataset | awk -F, '
    BEGIN {
      OFS=","
      print "File,Make,Model,DateTimeOriginal"
    }
    NR>1 {
      if ($3 == "" || $4 == "" || $6 == "") {
     print $1,"MISSING DATA" > "/dev/stderr"
     exit 1
      }
    }'
    

Backup Strategy

For dataset curators:

  1. 3-2-1 Backup Methodology:
    • 3 copies: Primary + two backups
    • 2 media types: Object storage + tape
    • 1 offsite: Geographical separation
  2. Immutable Backups:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    # MinIO bucket policy for dataset storage
    {
      "Version": "2012-10-17",
      "Statement": [
     {
       "Effect": "Deny",
       "Action": ["s3:DeleteObject"],
       "Principal": "*",
       "Resource": ["arn:aws:s3:::immich-dataset/*"],
       "Condition": {
         "NumericLessThan": {"aws:CurrentTime": "2026-01-01T00:00:00Z"}
       }
     }
      ]
    }
    

Troubleshooting Common Issues

Dataset Contribution Challenges

  1. Metadata Persistence After Sanitization:
    1
    2
    
    # Verify complete removal of GPS data
    exiftool -gps:all -filename -directory -r /processed_files | grep -v "GPS"
    
  2. Invalid File Formats:
    1
    2
    3
    
    # Batch conversion to compatible formats
    find . -type f -name '*.heic' -exec docker run --rm -v $PWD:/input \
      tianon/exiftool -tagsFromFile @ -all:all > /dev/null 2>&1 {} \;
    
  3. Corrupted Metadata:
    1
    2
    
    # Repair broken EXIF structures
    exiftool -all= -tagsfromfile @ -all:all -unsafe bad_image.jpg
    

Performance Tuning

For large dataset processing:

  1. I/O Optimization:
    1
    2
    
    # Mount options for processing storage
    mount -o noatime,nodiratime,data=writeback /dev/sdb /dataset
    
  2. Memory Caching:
    1
    2
    
    # Use vmtouch to cache metadata files
    vmtouch -t -m 16G /dataset/metadata/
    

Conclusion

The Immich EXIF dataset initiative represents a critical infrastructure project for the future of self-hosted photo management. As DevOps professionals, we have both the technical expertise and operational discipline to contribute high-quality, secure metadata that will benefit the entire Immich

This post is licensed under CC BY 4.0 by the author.