Skip to content

Model Lifecycle

When RELAY_MODEL_LIFECYCLE_ENABLED=true, Relay manages local model servers automatically.

How It Works

  1. Lazy loading — a model starts on the first request, not at boot
  2. Auto-shutdown — models unload after RELAY_MODEL_IDLE_SHUTDOWN_MS of inactivity (default 1 hour)
  3. Eager switching — when a client requests a different model, the old one is killed before the new one starts (keeps VRAM free)
  4. Session-aware context — a new session-id header triggers a model restart to clear KV cache
  5. Orphan cleanup — kills stale llama-server processes from previous Relay instances on startup
  6. Circuit breaker — stops retrying models that fail repeatedly

Port Allocation

Each model gets a dedicated port starting from RELAY_MODEL_PORT_BASE (default 8081). Relay routes requests to the correct port automatically. You can pin a model to a specific port by adding "port": 8085 to its model map entry.

Session-Aware Context Clearing

Send a session-id header with requests:

bash
curl -H "session-id: my-project" http://127.0.0.1:1234/v1/chat/completions ...

When the session ID changes (e.g. you switch from project-a to project-b), Relay restarts the model. This clears the KV cache so conversation state doesn't leak between sessions. If you don't send a session ID, the model keeps its context across requests.

Headers checked (first match wins): session-id, session_id, x-session-affinity, x-client-request-id.

Request Serialization

When RELAY_SERIALIZE_REQUESTS=true, Relay processes one request at a time (FCFS). Additional requests queue up. This prevents thrash when multiple agents hit the same model simultaneously.

Start Scripts

Model start scripts are generated by the setup wizard (relay setup) or provision command (relay provision). Each script:

  • Is a standalone shell script in start-scripts/
  • Accepts LLAMA_PORT env var for dynamic port allocation
  • Includes optimal flags from hardware detection (GPU layers, KV cache type, MoE offloading, --jinja)
  • Includes mmproj files for vision models and draft models for speculative decoding

Example generated script:

bash
#!/bin/bash
# relay model: qwen3.6-35b-a3b-ud-q4-k-xl
# context: 262144  arch: qwen35moe
set -e
exec "/home/achu/llama.cpp/build-vulkan/bin/llama-server" \
  --model "/home/achu/models/unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \
  --host 127.0.0.1 \
  --port ${LLAMA_PORT:-8081} \
  --ctx-size 262144 \
  -ngl 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --jinja \
  --n-cpu-moe 25

Docker

Relay runs in Docker with network_mode: host and pid: host. This gives it direct access to localhost ports (for model servers) and the ability to spawn/kill model processes on the host.