Skip to content

API Compatibility

Relay targets practical compatibility for local model servers, not full vendor parity.

Endpoints

Core

MethodPathDescription
GET/HTML status dashboard (model table, queue, lifecycle)
GET/healthJSON health check ({"ok":true})

OpenAI-compatible

MethodPathDescription
GET/v1/modelsList available models with context sizes
GET/v1/models/:modelGet single model metadata
POST/v1/chat/completionsChat completions (streaming + non-streaming)
POST/v1/completionsLegacy completions shim
POST/v1/responsesOpenAI Responses API
GET/v1/responses/:idGet stored response
DELETE/v1/responses/:idDelete stored response
POST/v1/embeddingsEmbeddings normalization
POST/v1/rerankRerank normalization
POST/rerankRerank (alt path)

Anthropic-compatible

MethodPathDescription
POST/v1/messagesAnthropic Messages API
POST/v1/messages/count_tokensAnthropic token counting

Observability

MethodPathDescription
GET/relay/statusFull lifecycle + queue JSON
GET/relay/metricsRequest counts, latencies, error rates
GET/relay/jobsJob queue state
GET/relay/lifecyclePer-model lifecycle details
GET/relay/statsRequest history
GET/relay/requestsRecent request details
GET/relay/requests/:idSingle request detail
GET/relay/capabilitiesCapability registry
POST/relay/capabilities/refreshRefresh capabilities

/v1/responses Streaming SSE Lifecycle

The streaming responses endpoint emits the full OpenAI Responses SSE event sequence:

response.created → response.in_progress
  → response.output_item.added     (message or function_call)
    → response.content_part.added   (message only)
    → response.output_text.delta    (× N, message only)
    → response.function_call_arguments.delta (× N, tool calls)
    → response.content_part.done    (message only)
    → response.function_call_arguments.done (tool calls)
  → response.output_item.done
→ response.completed   (or response.failed on error)

Conversation Continuation

/v1/responses supports previous_response_id. Relay stores the full chat message history for each response. When a follow-up request references a previous response, Relay reconstructs the conversation context:

json
{
  "model": "qwen3.6-35b-a3b",
  "input": "what about Tokyo?",
  "previous_response_id": "resp_abc123..."
}

Tool Calls

Both streaming and non-streaming responses support function tool calls. Non-function tools (web search, file search, code interpreter) are silently stripped since llama.cpp backends don't support them.

Reasoning/Thinking Models

Models that emit reasoning_content (e.g. DeepSeek, Qwen thinking variants) have their reasoning buffered and surfaced as output text when no regular content is generated.

Behavior Notes

  • Unknown/hosted-only fields are governed by RELAY_UNKNOWN_FIELD_POLICY
  • Streaming output is normalized to protocol-appropriate SSE
  • Tool calls are normalized across OpenAI and Anthropic shapes
  • Error responses use provider-native error shapes (OpenAI or Anthropic)
  • Field policies apply per-endpoint: pass_through, strip (with warning), or reject

Non-Goals

  • Hosted assistants/threads/runs orchestration
  • Realtime APIs
  • Image/audio/file generation APIs
  • Full vendor control-plane semantics
  • Batch endpoints