Operations

Production operations guide covering backups, monitoring, TLS, garbage collection, and maintenance.

Backups

How Backups Work

Ebla's backup system captures:

Block Storage

Blocks are not included in backups. For disaster recovery, ensure your block storage backend (S3, GCS, filesystem) has its own backup/replication strategy.

Automatic Backups

Enable scheduled backups in server.toml:

[backup]
enabled = true           # Enable automatic backups
interval = "24h"         # Backup frequency
retention = 7            # Keep N most recent backups
path = "/var/lib/ebla/backups"

Manual Backup

# Create backup via API
curl -X POST http://server:6333/api/v1/admin/backup/create \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Response
{
  "id": "backup_20260115_103000",
  "status": "completed",
  "size_bytes": 15728640,
  "duration_ms": 2500,
  "created_at": "2026-01-15T10:30:00Z"
}

List Backups

curl http://server:6333/api/v1/admin/backup/list \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Response
{
  "backups": [
    {
      "id": "backup_20260115_103000",
      "size_bytes": 15728640,
      "created_at": "2026-01-15T10:30:00Z",
      "status": "completed"
    },
    {
      "id": "backup_20260114_103000",
      "size_bytes": 14680064,
      "created_at": "2026-01-14T10:30:00Z",
      "status": "completed"
    }
  ]
}

Restore

# Dry run first (verify backup integrity)
curl -X POST "http://server:6333/api/v1/admin/backup/{id}/restore?dry_run=true" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Actual restore (destructive!)
curl -X POST "http://server:6333/api/v1/admin/backup/{id}/restore" \
  -H "Authorization: Bearer $ADMIN_TOKEN"
Restore Warnings
  • Restore is destructive - it replaces current data
  • Stop all clients before restoring
  • Ensure block storage is intact and accessible

Admin UI

Manage backups at /admin/backups:

Monitoring

Health Endpoint

curl http://server:6333/health

{
  "status": "healthy",
  "timestamp": "2026-01-15T10:30:00Z",
  "version": "0.56.0",
  "checks": {
    "database": "ok",
    "storage": "ok",
    "gc": "enabled"
  },
  "total_users": 25,
  "active_users": 12,
  "total_devices": 48,
  "online_devices": 18,
  "total_libraries": 35,
  "total_commits": 12456,
  "recent_commits": 156
}

Library Health

curl -H "Authorization: Bearer $TOKEN" \
  http://server:6333/api/v1/libraries/{id}/health

{
  "library_id": "lib_abc123",
  "library_name": "Documents",
  "status": "healthy",
  "head_commit_id": "commit_xyz",
  "total_commits": 156,
  "recent_commits": 12,
  "unresolved_conflicts": 0,
  "syncing_devices": 3,
  "up_to_date_devices": 5,
  "behind_devices": 0
}

Device Health

curl -H "Authorization: Bearer $TOKEN" \
  http://server:6333/api/v1/devices/{id}/health

{
  "device_id": "dev_123",
  "device_name": "MacBook Pro",
  "status": "healthy",
  "is_online": true,
  "last_seen": "2026-01-15T10:25:00Z",
  "synced_libraries": 3,
  "p2p_enabled": true,
  "recent_errors": 0
}

Server Diagnostics

ebla-server doctor --config /etc/ebla/server.toml

Ebla Server Doctor
==================
Config: /etc/ebla/server.toml

✓ Config: configuration file valid
✓ Database: database connection successful (PostgreSQL 16.1)
✓ Storage: filesystem storage accessible and writable
✓ Garbage Collection: GC enabled (interval: 24h, min_age: 24h)
✓ Backup: backup enabled (interval: 24h, retention: 7)
✓ Knowledge Layer: OpenAI configured (model: text-embedding-3-small)
✓ TLS: TLS enabled (cert expires: 2026-12-31)

Overall Status: ok
Timestamp: 2026-01-15T10:30:00Z

Prometheus Metrics

Expose Prometheus metrics:

[metrics]
enabled = true
path = "/metrics"    # Endpoint path
port = 9090          # Separate port (optional)

Available metrics:

TLS Configuration

Option 1: User-Provided Certificates

[server]
tls_cert = "/etc/ebla/cert.pem"
tls_key = "/etc/ebla/key.pem"

Certificate requirements:

Option 2: Reverse Proxy with Caddy

Let Caddy handle TLS with automatic Let's Encrypt:

# docker-compose.yml
services:
  caddy:
    image: caddy:2-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
    depends_on:
      - ebla-server

  ebla-server:
    image: ghcr.io/ebla-io/ebla-server:latest
    # No port exposure - Caddy proxies

volumes:
  caddy_data:
# Caddyfile
files.example.com {
    reverse_proxy ebla-server:6333
}

Option 3: Nginx Reverse Proxy

# /etc/nginx/sites-available/ebla
server {
    listen 443 ssl http2;
    server_name files.example.com;

    ssl_certificate /etc/letsencrypt/live/files.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/files.example.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:6333;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support
        proxy_read_timeout 86400;
    }

    # Large file uploads
    client_max_body_size 100M;
}

Garbage Collection

How GC Works

Garbage collection removes orphaned blocks:

  1. Scan all commits for referenced blocks
  2. Identify blocks not referenced by any commit
  3. Wait for minimum age (prevent race with uploads)
  4. Delete orphaned blocks

Configuration

[gc]
enabled = true       # Enable automatic GC
interval = "24h"     # Run every 24 hours
min_age = "24h"      # Only delete blocks older than 24h

Manual GC

# Dry run (report what would be deleted)
curl -X POST "http://server:6333/api/v1/admin/gc/run?dry_run=true" \
  -H "Authorization: Bearer $ADMIN_TOKEN"

{
  "dry_run": true,
  "orphan_blocks": 156,
  "orphan_size_bytes": 524288000,
  "would_delete": 156
}

# Actual GC run
curl -X POST http://server:6333/api/v1/admin/gc/run \
  -H "Authorization: Bearer $ADMIN_TOKEN"

{
  "dry_run": false,
  "deleted_blocks": 156,
  "freed_bytes": 524288000,
  "duration_ms": 12500
}

GC Status

curl http://server:6333/api/v1/admin/gc/status \
  -H "Authorization: Bearer $ADMIN_TOKEN"

{
  "enabled": true,
  "interval": "24h",
  "min_age": "24h",
  "last_run": "2026-01-15T02:00:00Z",
  "last_run_duration_ms": 15000,
  "last_run_deleted": 89,
  "last_run_freed_bytes": 356515840,
  "next_run": "2026-01-16T02:00:00Z"
}

Admin UI

Access GC controls at /admin/maintenance:

Tiered Storage

Overview

Tiered storage optimizes cost and performance by moving blocks between storage tiers:

Tier Purpose Example Backend
Hot Write buffer, frequently accessed NVMe filesystem
Warm Primary durable storage SSD, S3 Standard
Cold Infrequent access HDD, S3 IA
Archive Long-term retention S3 Glacier

Watermarks

Tiers use watermarks to control data flow:

Configuration

[tiered_storage]
enabled = true

[tiered_storage.commit]
finalize_requires_durable = true
staging_ttl = "1h"

[[tiered_storage.tiers]]
id = "hot"
order = 0
backend = "filesystem"
capacity = "50GB"
durable = false
flush_to = ["warm"]
soft_watermark = 70
hard_watermark = 80
critical_watermark = 90

[tiered_storage.tiers.filesystem]
path = "/mnt/nvme/ebla/hot"

[[tiered_storage.tiers]]
id = "warm"
order = 1
backend = "s3"
durable = true
flush_to = ["cold"]

[tiered_storage.tiers.s3]
bucket = "ebla-warm"
region = "us-east-1"

Two-Phase Commits

With tiered storage, commits use two phases:

  1. Staging: Commit created, blocks in hot tier
  2. Finalization: Blocks reach durable tier, commit becomes HEAD

Clients receive WebSocket notification when commits finalize.

Admin Dashboard

Access at /admin/tiers:

Scaling

Large Libraries

For libraries with 100K+ files, enable materialized views:

[browse]
materialization_enabled = true
materialization_strategy = "incremental"
file_threshold = 10000        # Auto-enable above this count
refresh_interval = "30s"
worker_count = 2

Horizontal Scaling

For high availability:

  1. Load Balancer: HAProxy, nginx, or cloud LB
  2. Stateless Servers: Multiple Ebla server instances
  3. PostgreSQL: Primary + read replicas or managed service
  4. Object Storage: S3/GCS (inherently distributed)
  5. WebSocket Affinity: Sticky sessions or Redis pub/sub
# HAProxy example
frontend ebla
    bind *:443 ssl crt /etc/ssl/ebla.pem
    default_backend ebla_servers

backend ebla_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200

    # Sticky sessions for WebSocket
    stick-table type ip size 100k expire 30m
    stick on src

    server ebla1 10.0.1.10:6333 check
    server ebla2 10.0.1.11:6333 check
    server ebla3 10.0.1.12:6333 check

Logging

Configuration

[logging]
level = "info"        # debug, info, warn, error
format = "json"       # json or text
output = "stdout"     # stdout or file path

Log Levels

Level Use
debug Verbose, all operations (development only)
info Normal operations, requests, events
warn Recoverable issues, degraded performance
error Failures requiring attention

Structured Logging

JSON format for log aggregation:

{
  "level": "info",
  "ts": "2026-01-15T10:30:00.000Z",
  "msg": "request completed",
  "method": "POST",
  "path": "/api/v1/blocks/upload",
  "status": 201,
  "duration_ms": 45,
  "user_id": "usr_abc123",
  "device_id": "dev_xyz"
}

Security Checklist

Before Production

Ongoing