Operations
Production operations guide covering backups, monitoring, TLS, garbage collection, and maintenance.
Backups
How Backups Work
Ebla's backup system captures:
- PostgreSQL Metadata: Users, libraries, commits, teams, permissions
- Block Manifest: List of all referenced blocks (blocks themselves are not copied)
- Configuration: Server settings at backup time
Blocks are not included in backups. For disaster recovery, ensure your block storage backend (S3, GCS, filesystem) has its own backup/replication strategy.
Automatic Backups
Enable scheduled backups in server.toml:
[backup]
enabled = true # Enable automatic backups
interval = "24h" # Backup frequency
retention = 7 # Keep N most recent backups
path = "/var/lib/ebla/backups"
Manual Backup
# Create backup via API
curl -X POST http://server:6333/api/v1/admin/backup/create \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Response
{
"id": "backup_20260115_103000",
"status": "completed",
"size_bytes": 15728640,
"duration_ms": 2500,
"created_at": "2026-01-15T10:30:00Z"
}
List Backups
curl http://server:6333/api/v1/admin/backup/list \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Response
{
"backups": [
{
"id": "backup_20260115_103000",
"size_bytes": 15728640,
"created_at": "2026-01-15T10:30:00Z",
"status": "completed"
},
{
"id": "backup_20260114_103000",
"size_bytes": 14680064,
"created_at": "2026-01-14T10:30:00Z",
"status": "completed"
}
]
}
Restore
# Dry run first (verify backup integrity)
curl -X POST "http://server:6333/api/v1/admin/backup/{id}/restore?dry_run=true" \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Actual restore (destructive!)
curl -X POST "http://server:6333/api/v1/admin/backup/{id}/restore" \
-H "Authorization: Bearer $ADMIN_TOKEN"
- Restore is destructive - it replaces current data
- Stop all clients before restoring
- Ensure block storage is intact and accessible
Admin UI
Manage backups at /admin/backups:
- View backup history with size and status
- Trigger manual backup
- Download backup files
- Initiate restore with dry-run option
Monitoring
Health Endpoint
curl http://server:6333/health
{
"status": "healthy",
"timestamp": "2026-01-15T10:30:00Z",
"version": "0.56.0",
"checks": {
"database": "ok",
"storage": "ok",
"gc": "enabled"
},
"total_users": 25,
"active_users": 12,
"total_devices": 48,
"online_devices": 18,
"total_libraries": 35,
"total_commits": 12456,
"recent_commits": 156
}
Library Health
curl -H "Authorization: Bearer $TOKEN" \
http://server:6333/api/v1/libraries/{id}/health
{
"library_id": "lib_abc123",
"library_name": "Documents",
"status": "healthy",
"head_commit_id": "commit_xyz",
"total_commits": 156,
"recent_commits": 12,
"unresolved_conflicts": 0,
"syncing_devices": 3,
"up_to_date_devices": 5,
"behind_devices": 0
}
Device Health
curl -H "Authorization: Bearer $TOKEN" \
http://server:6333/api/v1/devices/{id}/health
{
"device_id": "dev_123",
"device_name": "MacBook Pro",
"status": "healthy",
"is_online": true,
"last_seen": "2026-01-15T10:25:00Z",
"synced_libraries": 3,
"p2p_enabled": true,
"recent_errors": 0
}
Server Diagnostics
ebla-server doctor --config /etc/ebla/server.toml
Ebla Server Doctor
==================
Config: /etc/ebla/server.toml
✓ Config: configuration file valid
✓ Database: database connection successful (PostgreSQL 16.1)
✓ Storage: filesystem storage accessible and writable
✓ Garbage Collection: GC enabled (interval: 24h, min_age: 24h)
✓ Backup: backup enabled (interval: 24h, retention: 7)
✓ Knowledge Layer: OpenAI configured (model: text-embedding-3-small)
✓ TLS: TLS enabled (cert expires: 2026-12-31)
Overall Status: ok
Timestamp: 2026-01-15T10:30:00Z
Prometheus Metrics
Expose Prometheus metrics:
[metrics]
enabled = true
path = "/metrics" # Endpoint path
port = 9090 # Separate port (optional)
Available metrics:
ebla_http_requests_total- HTTP request count by method/path/statusebla_http_request_duration_seconds- Request latency histogramebla_active_connections- Current WebSocket connectionsebla_blocks_uploaded_total- Total blocks uploadedebla_blocks_downloaded_total- Total blocks downloadedebla_commits_created_total- Total commits createdebla_storage_bytes- Total storage used by tier
TLS Configuration
Option 1: User-Provided Certificates
[server]
tls_cert = "/etc/ebla/cert.pem"
tls_key = "/etc/ebla/key.pem"
Certificate requirements:
- PEM format
- Full chain (including intermediates)
- RSA or ECDSA key
Option 2: Reverse Proxy with Caddy
Let Caddy handle TLS with automatic Let's Encrypt:
# docker-compose.yml
services:
caddy:
image: caddy:2-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- caddy_data:/data
depends_on:
- ebla-server
ebla-server:
image: ghcr.io/ebla-io/ebla-server:latest
# No port exposure - Caddy proxies
volumes:
caddy_data:
# Caddyfile
files.example.com {
reverse_proxy ebla-server:6333
}
Option 3: Nginx Reverse Proxy
# /etc/nginx/sites-available/ebla
server {
listen 443 ssl http2;
server_name files.example.com;
ssl_certificate /etc/letsencrypt/live/files.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/files.example.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:6333;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support
proxy_read_timeout 86400;
}
# Large file uploads
client_max_body_size 100M;
}
Garbage Collection
How GC Works
Garbage collection removes orphaned blocks:
- Scan all commits for referenced blocks
- Identify blocks not referenced by any commit
- Wait for minimum age (prevent race with uploads)
- Delete orphaned blocks
Configuration
[gc]
enabled = true # Enable automatic GC
interval = "24h" # Run every 24 hours
min_age = "24h" # Only delete blocks older than 24h
Manual GC
# Dry run (report what would be deleted)
curl -X POST "http://server:6333/api/v1/admin/gc/run?dry_run=true" \
-H "Authorization: Bearer $ADMIN_TOKEN"
{
"dry_run": true,
"orphan_blocks": 156,
"orphan_size_bytes": 524288000,
"would_delete": 156
}
# Actual GC run
curl -X POST http://server:6333/api/v1/admin/gc/run \
-H "Authorization: Bearer $ADMIN_TOKEN"
{
"dry_run": false,
"deleted_blocks": 156,
"freed_bytes": 524288000,
"duration_ms": 12500
}
GC Status
curl http://server:6333/api/v1/admin/gc/status \
-H "Authorization: Bearer $ADMIN_TOKEN"
{
"enabled": true,
"interval": "24h",
"min_age": "24h",
"last_run": "2026-01-15T02:00:00Z",
"last_run_duration_ms": 15000,
"last_run_deleted": 89,
"last_run_freed_bytes": 356515840,
"next_run": "2026-01-16T02:00:00Z"
}
Admin UI
Access GC controls at /admin/maintenance:
- View GC status and history
- Trigger manual GC (dry-run or actual)
- View orphan block report
Tiered Storage
Overview
Tiered storage optimizes cost and performance by moving blocks between storage tiers:
| Tier | Purpose | Example Backend |
|---|---|---|
| Hot | Write buffer, frequently accessed | NVMe filesystem |
| Warm | Primary durable storage | SSD, S3 Standard |
| Cold | Infrequent access | HDD, S3 IA |
| Archive | Long-term retention | S3 Glacier |
Watermarks
Tiers use watermarks to control data flow:
- Soft (70%): Start flushing to next tier
- Hard (80%): Increase flush priority
- Critical (90%): Bypass tier, write directly to next
Configuration
[tiered_storage]
enabled = true
[tiered_storage.commit]
finalize_requires_durable = true
staging_ttl = "1h"
[[tiered_storage.tiers]]
id = "hot"
order = 0
backend = "filesystem"
capacity = "50GB"
durable = false
flush_to = ["warm"]
soft_watermark = 70
hard_watermark = 80
critical_watermark = 90
[tiered_storage.tiers.filesystem]
path = "/mnt/nvme/ebla/hot"
[[tiered_storage.tiers]]
id = "warm"
order = 1
backend = "s3"
durable = true
flush_to = ["cold"]
[tiered_storage.tiers.s3]
bucket = "ebla-warm"
region = "us-east-1"
Two-Phase Commits
With tiered storage, commits use two phases:
- Staging: Commit created, blocks in hot tier
- Finalization: Blocks reach durable tier, commit becomes HEAD
Clients receive WebSocket notification when commits finalize.
Admin Dashboard
Access at /admin/tiers:
- Tier cards with usage and watermark indicators
- System mode banner (Normal, Mandatory Flush, Bypass, Throttling)
- Flush queue with job counts
- Staging commits pending durability
- Manual flush/pause controls
Scaling
Large Libraries
For libraries with 100K+ files, enable materialized views:
[browse]
materialization_enabled = true
materialization_strategy = "incremental"
file_threshold = 10000 # Auto-enable above this count
refresh_interval = "30s"
worker_count = 2
Horizontal Scaling
For high availability:
- Load Balancer: HAProxy, nginx, or cloud LB
- Stateless Servers: Multiple Ebla server instances
- PostgreSQL: Primary + read replicas or managed service
- Object Storage: S3/GCS (inherently distributed)
- WebSocket Affinity: Sticky sessions or Redis pub/sub
# HAProxy example
frontend ebla
bind *:443 ssl crt /etc/ssl/ebla.pem
default_backend ebla_servers
backend ebla_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
# Sticky sessions for WebSocket
stick-table type ip size 100k expire 30m
stick on src
server ebla1 10.0.1.10:6333 check
server ebla2 10.0.1.11:6333 check
server ebla3 10.0.1.12:6333 check
Logging
Configuration
[logging]
level = "info" # debug, info, warn, error
format = "json" # json or text
output = "stdout" # stdout or file path
Log Levels
| Level | Use |
|---|---|
| debug | Verbose, all operations (development only) |
| info | Normal operations, requests, events |
| warn | Recoverable issues, degraded performance |
| error | Failures requiring attention |
Structured Logging
JSON format for log aggregation:
{
"level": "info",
"ts": "2026-01-15T10:30:00.000Z",
"msg": "request completed",
"method": "POST",
"path": "/api/v1/blocks/upload",
"status": 201,
"duration_ms": 45,
"user_id": "usr_abc123",
"device_id": "dev_xyz"
}
Security Checklist
Before Production
- [ ] Change default JWT secret (random 64+ characters)
- [ ] Enable TLS (user certs or reverse proxy)
- [ ] Set
allow_signup = falseif not needed - [ ] Remove bootstrap token after initial setup
- [ ] Configure firewall (only expose 443)
- [ ] Enable automatic backups
- [ ] Set up monitoring alerts
Ongoing
- [ ] Review audit logs regularly
- [ ] Rotate TLS certificates before expiry
- [ ] Update Ebla to latest version
- [ ] Monitor storage usage and GC
- [ ] Test backup restoration periodically