Operations

Production operations guide covering backups, monitoring, TLS, garbage collection, and maintenance.

Backups

How Backups Work

Ebla's backup system captures:

PostgreSQL Metadata: Users, libraries, commits, teams, permissions
Block Manifest: List of all referenced blocks (blocks themselves are not copied)
Configuration: Server settings at backup time

Block Storage

Blocks are not included in backups. For disaster recovery, ensure your block storage backend (S3, GCS, filesystem) has its own backup/replication strategy.

Automatic Backups

Enable scheduled backups in server.toml:

[backup]
enabled = true           # Enable automatic backups
interval = "24h"         # Backup frequency
retention = 7            # Keep N most recent backups
path = "/var/lib/ebla/backups"

Manual Backup

# Create backup via API
curl -X POST http://server:6333/api/v1/admin/backup/create \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Response
{
  "id": "backup_20260115_103000",
  "status": "completed",
  "size_bytes": 15728640,
  "duration_ms": 2500,
  "created_at": "2026-01-15T10:30:00Z"
}

List Backups

curl http://server:6333/api/v1/admin/backup/list \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Response
{
  "backups": [
    {
      "id": "backup_20260115_103000",
      "size_bytes": 15728640,
      "created_at": "2026-01-15T10:30:00Z",
      "status": "completed"
    },
    {
      "id": "backup_20260114_103000",
      "size_bytes": 14680064,
      "created_at": "2026-01-14T10:30:00Z",
      "status": "completed"
    }
  ]
}

Restore

# Dry run first (verify backup integrity)
curl -X POST "http://server:6333/api/v1/admin/backup/{id}/restore?dry_run=true" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Actual restore (destructive!)
curl -X POST "http://server:6333/api/v1/admin/backup/{id}/restore" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Restore Warnings

Restore is destructive - it replaces current data
Stop all clients before restoring
Ensure block storage is intact and accessible

Admin UI

Manage backups at /admin/backups:

View backup history with size and status
Trigger manual backup
Download backup files
Initiate restore with dry-run option

Monitoring

Health Endpoint

curl http://server:6333/health

{
  "status": "healthy",
  "timestamp": "2026-01-15T10:30:00Z",
  "version": "0.56.0",
  "checks": {
    "database": "ok",
    "storage": "ok",
    "gc": "enabled"
  },
  "total_users": 25,
  "active_users": 12,
  "total_devices": 48,
  "online_devices": 18,
  "total_libraries": 35,
  "total_commits": 12456,
  "recent_commits": 156
}

Library Health

curl -H "Authorization: Bearer $TOKEN" \
  http://server:6333/api/v1/libraries/{id}/health

{
  "library_id": "lib_abc123",
  "library_name": "Documents",
  "status": "healthy",
  "head_commit_id": "commit_xyz",
  "total_commits": 156,
  "recent_commits": 12,
  "unresolved_conflicts": 0,
  "syncing_devices": 3,
  "up_to_date_devices": 5,
  "behind_devices": 0
}

Device Health

curl -H "Authorization: Bearer $TOKEN" \
  http://server:6333/api/v1/devices/{id}/health

{
  "device_id": "dev_123",
  "device_name": "MacBook Pro",
  "status": "healthy",
  "is_online": true,
  "last_seen": "2026-01-15T10:25:00Z",
  "synced_libraries": 3,
  "p2p_enabled": true,
  "recent_errors": 0
}

Server Diagnostics

ebla-server doctor --config /etc/ebla/server.toml

Ebla Server Doctor
==================
Config: /etc/ebla/server.toml

✓ Config: configuration file valid
✓ Database: database connection successful (PostgreSQL 16.1)
✓ Storage: filesystem storage accessible and writable
✓ Garbage Collection: GC enabled (interval: 24h, min_age: 24h)
✓ Backup: backup enabled (interval: 24h, retention: 7)
✓ Knowledge Layer: OpenAI configured (model: text-embedding-3-small)
✓ TLS: TLS enabled (cert expires: 2026-12-31)

Overall Status: ok
Timestamp: 2026-01-15T10:30:00Z

Prometheus Metrics

Expose Prometheus metrics:

[metrics]
enabled = true
path = "/metrics"    # Endpoint path
port = 9090          # Separate port (optional)

Available metrics:

ebla_http_requests_total - HTTP request count by method/path/status
ebla_http_request_duration_seconds - Request latency histogram
ebla_active_connections - Current WebSocket connections
ebla_blocks_uploaded_total - Total blocks uploaded
ebla_blocks_downloaded_total - Total blocks downloaded
ebla_commits_created_total - Total commits created
ebla_storage_bytes - Total storage used by tier

TLS Configuration

Option 1: User-Provided Certificates

[server]
tls_cert = "/etc/ebla/cert.pem"
tls_key = "/etc/ebla/key.pem"

Certificate requirements:

PEM format
Full chain (including intermediates)
RSA or ECDSA key

Option 2: Reverse Proxy with Caddy

Let Caddy handle TLS with automatic Let's Encrypt:

# docker-compose.yml
services:
  caddy:
    image: caddy:2-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
    depends_on:
      - ebla-server

  ebla-server:
    image: ghcr.io/ebla-io/ebla-server:latest
    # No port exposure - Caddy proxies

volumes:
  caddy_data:

# Caddyfile
files.example.com {
    reverse_proxy ebla-server:6333
}

Option 3: Nginx Reverse Proxy

# /etc/nginx/sites-available/ebla
server {
    listen 443 ssl http2;
    server_name files.example.com;

    ssl_certificate /etc/letsencrypt/live/files.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/files.example.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:6333;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support
        proxy_read_timeout 86400;
    }

    # Large file uploads
    client_max_body_size 100M;
}

Garbage Collection

How GC Works

Garbage collection removes orphaned blocks:

Scan all commits for referenced blocks
Identify blocks not referenced by any commit
Wait for minimum age (prevent race with uploads)
Delete orphaned blocks

Configuration

[gc]
enabled = true       # Enable automatic GC
interval = "24h"     # Run every 24 hours
min_age = "24h"      # Only delete blocks older than 24h

Manual GC

# Dry run (report what would be deleted)
curl -X POST "http://server:6333/api/v1/admin/gc/run?dry_run=true" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

{
  "dry_run": true,
  "orphan_blocks": 156,
  "orphan_size_bytes": 524288000,
  "would_delete": 156
}

# Actual GC run
curl -X POST http://server:6333/api/v1/admin/gc/run \
  -H "Authorization: Bearer $ADMIN_TOKEN"

{
  "dry_run": false,
  "deleted_blocks": 156,
  "freed_bytes": 524288000,
  "duration_ms": 12500
}

GC Status

curl http://server:6333/api/v1/admin/gc/status \
  -H "Authorization: Bearer $ADMIN_TOKEN"

{
  "enabled": true,
  "interval": "24h",
  "min_age": "24h",
  "last_run": "2026-01-15T02:00:00Z",
  "last_run_duration_ms": 15000,
  "last_run_deleted": 89,
  "last_run_freed_bytes": 356515840,
  "next_run": "2026-01-16T02:00:00Z"
}

Admin UI

Access GC controls at /admin/maintenance:

View GC status and history
Trigger manual GC (dry-run or actual)
View orphan block report

Tiered Storage

Overview

Tiered storage optimizes cost and performance by moving blocks between storage tiers:

Tier	Purpose	Example Backend
Hot	Write buffer, frequently accessed	NVMe filesystem
Warm	Primary durable storage	SSD, S3 Standard
Cold	Infrequent access	HDD, S3 IA
Archive	Long-term retention	S3 Glacier

Watermarks

Tiers use watermarks to control data flow:

Soft (70%): Start flushing to next tier
Hard (80%): Increase flush priority
Critical (90%): Bypass tier, write directly to next

Configuration

[tiered_storage]
enabled = true

[tiered_storage.commit]
finalize_requires_durable = true
staging_ttl = "1h"

[[tiered_storage.tiers]]
id = "hot"
order = 0
backend = "filesystem"
capacity = "50GB"
durable = false
flush_to = ["warm"]
soft_watermark = 70
hard_watermark = 80
critical_watermark = 90

[tiered_storage.tiers.filesystem]
path = "/mnt/nvme/ebla/hot"

[[tiered_storage.tiers]]
id = "warm"
order = 1
backend = "s3"
durable = true
flush_to = ["cold"]

[tiered_storage.tiers.s3]
bucket = "ebla-warm"
region = "us-east-1"

Two-Phase Commits

With tiered storage, commits use two phases:

Staging: Commit created, blocks in hot tier
Finalization: Blocks reach durable tier, commit becomes HEAD

Clients receive WebSocket notification when commits finalize.

Admin Dashboard

Access at /admin/tiers:

Tier cards with usage and watermark indicators
System mode banner (Normal, Mandatory Flush, Bypass, Throttling)
Flush queue with job counts
Staging commits pending durability
Manual flush/pause controls

Scaling

Large Libraries

For libraries with 100K+ files, enable materialized views:

[browse]
materialization_enabled = true
materialization_strategy = "incremental"
file_threshold = 10000        # Auto-enable above this count
refresh_interval = "30s"
worker_count = 2

Horizontal Scaling

For high availability:

Load Balancer: HAProxy, nginx, or cloud LB
Stateless Servers: Multiple Ebla server instances
PostgreSQL: Primary + read replicas or managed service
Object Storage: S3/GCS (inherently distributed)
WebSocket Affinity: Sticky sessions or Redis pub/sub

# HAProxy example
frontend ebla
    bind *:443 ssl crt /etc/ssl/ebla.pem
    default_backend ebla_servers

backend ebla_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200

    # Sticky sessions for WebSocket
    stick-table type ip size 100k expire 30m
    stick on src

    server ebla1 10.0.1.10:6333 check
    server ebla2 10.0.1.11:6333 check
    server ebla3 10.0.1.12:6333 check

Logging

Configuration

[logging]
level = "info"        # debug, info, warn, error
format = "json"       # json or text
output = "stdout"     # stdout or file path

Log Levels

Level	Use
debug	Verbose, all operations (development only)
info	Normal operations, requests, events
warn	Recoverable issues, degraded performance
error	Failures requiring attention

Structured Logging

JSON format for log aggregation:

{
  "level": "info",
  "ts": "2026-01-15T10:30:00.000Z",
  "msg": "request completed",
  "method": "POST",
  "path": "/api/v1/blocks/upload",
  "status": 201,
  "duration_ms": 45,
  "user_id": "usr_abc123",
  "device_id": "dev_xyz"
}

Security Checklist

Before Production

[ ] Change default JWT secret (random 64+ characters)
[ ] Enable TLS (user certs or reverse proxy)
[ ] Set allow_signup = false if not needed
[ ] Remove bootstrap token after initial setup
[ ] Configure firewall (only expose 443)
[ ] Enable automatic backups
[ ] Set up monitoring alerts

Ongoing

[ ] Review audit logs regularly
[ ] Rotate TLS certificates before expiry
[ ] Update Ebla to latest version
[ ] Monitor storage usage and GC
[ ] Test backup restoration periodically