API Reference
SUB&SUB exposes a multi-provider relay at https://api.subnsub.com/v1. OpenAI clients hit /v1/chat/completions; Anthropic clients hit /v1/messages. The same sk-cf-... key routes both — pick the model in the request body and the relay picks the upstream.
Quick start
Three things you need:
- Base URL:
https://api.subnsub.com/v1(OpenAI clients) orhttps://api.subnsub.com(Anthropic clients — the SDK appends/v1/messagesitself) - API key:
sk-cf-...issued from the console - Model: one of the 21 verified models — e.g.
gpt-5.4-miniorclaude-haiku-4.5
Authentication
Every request must carry an Authorization: Bearer sk-cf-... header. Keys are issued from the console and stored as SHA-256 hashes — once you leave the creation screen, the plaintext is gone forever, so save it immediately.
Endpoints
POST /v1/chat/completions
Send a chat completion request. Request shape matches the OpenAI Chat Completions API — the OpenAI SDKs work unmodified.
| Parameter | Type | Description |
|---|---|---|
| model | string | One of the verified model IDs. |
| messages | array | Conversation history. Each item: {role, content} with role ∈ system / user / assistant. |
| stream | boolean | If true, response is sent as SSE chunks. See Streaming. |
| stream_options | object | Optional. The relay always forces {include_usage: true} upstream so the final chunk carries the token-usage block — overriding it has no effect. |
| max_tokens | integer | Cap completion length. Defaults to the model's maximum. |
| temperature | number | 0 – 2. Higher = more random. |
POST /v1/messages
Anthropic-native endpoint for the claude-* models — the Anthropic SDK (anthropic-sdk-python, @anthropic-ai/sdk, claude-code) works unmodified against this path. Point your base URL at https://api.subnsub.com and authenticate via the x-api-key header (the Authorization-Bearer form works too, if your client prefers it).
| Parameter | Type | Description |
|---|---|---|
| model | string | A claude-* model ID (see Available models). Passing an OpenAI model here returns 400 invalid_request_error. |
| max_tokens | integer | Required by Anthropic — caps the assistant reply length. |
| messages | array | Conversation history, Anthropic shape: {role, content} with role ∈ user / assistant. |
| stream | boolean | If true, returns the standard Anthropic SSE event sequence: message_start, content_block_delta, message_delta, message_stop. |
| thinking | object | Forwarded verbatim. Pair with a -thinking model variant to enable extended thinking. |
| cache_control | object | Prompt-caching is supported. Cache-write tokens bill at 1.25× and cache-read tokens at 0.10× the tier's input rate. |
GET /v1/models
List the models you can actually call. The relay merges the OpenAI and Anthropic catalogues from sub2api and filters down to the 21 we have end-to-end-tested against the current account pool — phantom IDs that 503 at routing time, or upstream-400 on first token, are hidden. If every configured upstream is unreachable the endpoint returns 502 models_unreachable rather than a misleading empty list.
# sample response (truncated)
{
"object": "list",
"data": [
{ "id": "gpt-5.4-mini", "type": "model", ... },
{ "id": "gpt-5.4", "type": "model", ... },
{ "id": "claude-sonnet-4.5", "type": "model", ... },
{ "id": "claude-haiku-4.5", "type": "model", ... },
...
]
}
Available models
Two upstream families. The 7 OpenAI models route to shared ChatGPT-tier accounts; the 14 Claude models route through a Kiro reverse proxy onto AWS CodeWhisperer. Per-token rates depend on the tier (see Pricing) — the same key works for both.
OpenAI
| Model ID | Family | Tier | Notes |
|---|---|---|---|
| gpt-5.4-mini | GPT-5.4 | Mini | Fast & cheap. Recommended default for chat & coding. |
| gpt-5.3-codex | Codex | Mini | Coding-tuned 5.3. Same price as mini. |
| gpt-5.2 | GPT-5.2 | Standard | Stable 5.2. |
| gpt-5.2-chat-latest | GPT-5.2 | Standard | Auto-tracks latest 5.2 chat tune (currently maps upstream to gpt-5.2). |
| gpt-5.4 | GPT-5.4 | Standard | Full-size GPT-5.4 — slower, stronger reasoning. |
| gpt-5.4-2026-03-05 | GPT-5.4 | Standard | Date-stamped snapshot of gpt-5.4. |
| gpt-5.5 | GPT-5.5 | Premium | Newer flagship. |
Anthropic
| Model ID | Family | Tier | Notes |
|---|---|---|---|
| claude-haiku-4.5 | Haiku 4.5 | Mini | Smallest Claude — same per-token rate as gpt-5.4-mini. |
| claude-haiku-4.5-thinking | Haiku 4.5 | Mini | Extended-thinking variant of haiku-4.5. Pair with the thinking request field. |
| claude-sonnet-4.5 | Sonnet 4.5 | Standard | Mid-tier Claude — same per-token rate as gpt-5.4. |
| claude-sonnet-4.5-thinking | Sonnet 4.5 | Standard | Extended-thinking variant of sonnet-4.5. |
| claude-sonnet-4.6 | Sonnet 4.6 | Standard | Newer Sonnet tune — Standard tier, same rate as sonnet-4.5. |
| claude-sonnet-4.6-thinking | Sonnet 4.6 | Standard | Extended-thinking variant of sonnet-4.6. |
| claude-opus-4.5 | Opus 4.5 | Ultra | Frontier Claude. Billed at Anthropic's list price — no margin (see Pricing). |
| claude-opus-4.5-thinking | Opus 4.5 | Ultra | Extended-thinking variant of opus-4.5 (adaptive thinking). |
| claude-opus-4.6 | Opus 4.6 | Ultra | Newer Opus tune. |
| claude-opus-4.6-thinking | Opus 4.6 | Ultra | Extended-thinking variant of opus-4.6. |
| claude-opus-4.7 | Opus 4.7 | Ultra | Previous Opus snapshot. |
| claude-opus-4.7-thinking | Opus 4.7 | Ultra | Extended-thinking variant of opus-4.7. |
| claude-opus-4.8 | Opus 4.8 | Ultra | Latest Opus snapshot. |
| claude-opus-4.8-thinking | Opus 4.8 | Ultra | Extended-thinking variant of opus-4.8. |
gpt-5.2-pro, gpt-5.2-pro-2025-12-11), gpt-4o, gpt-4o-mini, gpt-5, gpt-5-mini, image / audio / realtime variants, claude-haiku-4-6, and dated upstream IDs (e.g. claude-sonnet-4-5-20250929). Calling them returns 400 model_not_available. Pro models are off the menu because the underlying social-tier accounts in the pool would exhaust their tiny quotas in a handful of requests.
Reasoning effort
Every OpenAI model above is a reasoning model — the backend can spend more or fewer "thinking" tokens before emitting visible output. Set reasoning_effort on the OpenAI /v1/chat/completions request body to control the budget. For Claude, use the Anthropic-native thinking request field (or pick a -thinking model variant) — see the /v1/messages section. The OpenAI models accept the same five effort values:
| Value | Behavior |
|---|---|
| none | No thinking — straight to the answer. Cheapest and fastest. |
| low | A short reasoning pass. |
| medium | Default if you don't pass the field. Balanced. |
| high | Deeper reasoning. Recommended for non-trivial coding / multi-step problems. |
| xhigh | Maximum effort. Slowest and most expensive; reserve for hard analysis where you genuinely need it. |
# Two equivalent forms — pick whichever your SDK supports
{
"model": "gpt-5.4-mini",
"reasoning_effort": "high",
"messages": [ ... ]
}
{
"model": "gpt-5.5",
"reasoning": { "effort": "xhigh" },
"messages": [ ... ]
}
'minimal', but our pool's models reject it: "'minimal' is not supported with this model". Stick to the five values above.
Streaming
Set "stream": true to receive Server-Sent Events. The final chunk carries a usage block (we force stream_options.include_usage upstream so token counts are always emitted), then a literal data: [DONE] closes the stream.
# Streaming format (line by line)
data: {"id":"resp_...","choices":[{"delta":{"content":"Hi"}}]}
data: {"id":"resp_...","choices":[{"delta":{"content":"!"}}]}
data: {"id":"resp_...","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":11,"total_tokens":29}}
data: [DONE]
Web search
Append :online to any model ID supported by the endpoint and the relay will run a web search before forwarding to the model, prepending the results to the conversation so the answer is grounded in fresh data. The suffix works on /v1/chat/completions and /v1/messages (the latter still requires a claude-* base); no search-specific request fields are required.
# Same call as before — just :online on the model
curl https://api.subnsub.com/v1/chat/completions \
-H "Authorization: Bearer sk-cf-xxx" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4-mini:online",
"messages": [
{"role": "user", "content": "What did Anthropic ship this week?"}
]
}'
How it works: the relay strips :online, takes the most recent user message as the query (capped at 400 characters), calls Tavily for up to 3 results with extracted page text when available, plus an optional Tavily-generated summary, then prepends them to that same user turn as a clearly-delimited <search_results> block before sending the request upstream. The search call has an 8-second timeout. Results are deliberately injected into the user role — never the system prompt — so untrusted snippets can't be elevated to system-priority instructions.
The <search_results> block looks like this. It's preceded by a one-line instruction telling the model to treat the block as untrusted external data and cite numbered items inline:
<search_results query="What did Anthropic ship this week?" retrieved="2026-05-21">
Summary: <short LLM-generated synthesis of the result set>
[1] Anthropic launches Opus 4.8
URL: https://www.anthropic.com/news/opus-4-8
<extracted page text, or short snippet if extraction failed — up to ~2000 chars>
[2] ...
</search_results>
| Behavior | Detail |
|---|---|
| Cost | No surcharge today — you pay the model's normal per-token rate; the relay absorbs the search call. The injected <search_results> block does count as input tokens, so expect a higher prompt-token bill than the same question without :online. |
| Failure mode | Soft. If Tavily times out or errors, the request continues to the model without search context (you still get an answer, just ungrounded). The only hard failure is 503 search_unavailable when search isn't configured on the relay at all. |
| count_tokens | /v1/messages/count_tokens strips the suffix but never calls Tavily — the count reflects your original prompt, not the augmented one. |
| Multi-turn | Only the last user turn is queried & augmented; earlier turns are untouched. To search again, send a new user message with :online still on the model. |
When to use :online
The relay does a single Tavily call per request and injects the results — it is not an agentic search loop. The model does not decide to re-search based on what it sees, the way Perplexity Sonar or the ChatGPT browse tool do. Plan around that limitation:
| Good fit | Bad fit |
|---|---|
| Time-sensitive facts (news, prices, version numbers, release dates) | Private or pasted code that isn't on the public web — adds prompt noise without grounding |
| Locating an official doc or announcement | Math, reasoning, translation, creative writing — nothing to ground |
| Anything you would otherwise verify by Googling | Stable knowledge already in training data ("what is a binary tree") |
Phrase the last user message as a standalone search query. The search runs against the literal text of your most recent user turn (capped at 400 chars), so conversational follow-ups like "and what about the latest version?" become useless queries with no context. In a multi-turn chat, restate the topic when you add :online — e.g. "latest version of the Anthropic Python SDK" rather than "the latest one".
For questions that need multi-step synthesis (compare-and-contrast, deep research), break them into multiple turns and add :online to each. The model will read each turn's fresh results; you steer the next query manually. Note that the injected <search_results> block is sent upstream only — it isn't echoed back to your client and isn't preserved into the next request, so if a later turn depends on details from earlier sources, ask the model to summarise them in its visible reply. One-shot research mode is not supported.
reasoning_effort: "high") so the model actually weighs the returned sources rather than leaning on the first result. The injected instruction asks the model to cite numbered sources as [1], [2] inline, so the output will usually carry such citations — though the model isn't strictly bound to that format.
Errors
The envelope depends on which endpoint you called — the relay returns errors in the protocol that matches the caller's SDK, and upstream errors are passed through verbatim.
OpenAI paths (/v1/chat/completions, /v1/responses, /v1/models) — OpenAI envelope:
{ "error": { "message": "...", "type": "...", "code": "..." } }
Anthropic paths (/v1/messages, /v1/messages/count_tokens) — Anthropic envelope:
{ "type": "error", "error": { "type": "...", "message": "..." } }
The Anthropic envelope uses a different shape — no code field, and the discriminator type: "error" is at the top level (with the inner error.type giving the category, e.g. authentication_error, invalid_request_error, permission_error, api_error). Anthropic SDKs already parse this shape; vanilla OpenAI SDK error handlers won't, so call /v1/messages with an Anthropic SDK (or do raw HTTP).
Status codes are the canonical HTTP ones across both protocols:
| Status | OpenAI code / Anthropic error.type | Meaning |
|---|---|---|
| 401 | invalid_api_key / authentication_error | Missing or unknown sk-cf-... key. |
| 402 | insufficient_balance / permission_error | Account balance is negative. Top up in the console billing tab. |
| 403 | key_revoked / permission_error | The key was revoked. |
| 400 | model_not_available / invalid_request_error | The model you sent isn't in the verified catalogue, or is wrong for the endpoint (e.g. an OpenAI model on /v1/messages) — check Available models. |
| 503 | — | No upstream account currently serves the request — usually a pool-wide rate-limit window, not a config issue. |
| 503 | search_unavailable / api_error | You used :online but web search isn't configured on this relay. See Web search. |
| 502 | upstream_unreachable / api_error | Relay couldn't reach the backend. Retry after a short backoff. |
Pricing & billing
Pay-as-you-go, billed per token in microdollars (1 micro = $0.000001 = 1/10,000 of a cent) so sub-cent requests are tracked accurately. Rates are per 1M tokens, by tier — see the model table for which tier each model maps to.
| Tier | Models | Input / 1M | Output / 1M |
|---|---|---|---|
| Mini | gpt-5.4-mini, gpt-5.3-codex, claude-haiku-4.5, claude-haiku-4.5-thinking | $0.20 | $1.60 |
| Standard | gpt-5.2, gpt-5.2-chat-latest, gpt-5.4, gpt-5.4-2026-03-05, claude-sonnet-4.5, claude-sonnet-4.5-thinking, claude-sonnet-4.6, claude-sonnet-4.6-thinking | $0.75 | $6.00 |
| Premium | gpt-5.5 | $1.10 | $8.80 |
| Ultra | claude-opus-4.5, claude-opus-4.5-thinking, claude-opus-4.6, claude-opus-4.6-thinking, claude-opus-4.7, claude-opus-4.7-thinking, claude-opus-4.8, claude-opus-4.8-thinking | $5.00 | $25.00 |
Ultra-tier rates match Anthropic's published Opus list price — straight pass-through, no margin. The other tiers run below their upstream rates thanks to the pooled subscription backing.
Reasoning tokens (when you set reasoning_effort on OpenAI, or use a Claude -thinking variant) count as output tokens at the model's tier rate — there's no separate surcharge for high effort, but a deep-thinking request can easily emit 10–50× more output tokens than a no-effort one, so the dollar bill scales with it.
Anthropic prompt-caching bills as a separate line item: cache writes at 1.25× and cache reads at 0.10× the tier's input rate. So a haiku-4.5 cache hit costs 0.20 × 0.10 = $0.02 per 1M tokens, and a sonnet-4.5 cache hit costs 0.75 × 0.10 = $0.075 per 1M tokens. Cache columns are recorded on each settle row so the console can show the breakdown.
Balance is deducted in real time as each request returns — for streaming requests, settlement runs after the [DONE] chunk lands. View your live balance and per-request settlements at /console#billing.
Rate limits
No per-key rate limits today. The upstream account pool concurrency & OpenAI server-side throttling apply; if you hit those, the relay returns 429 with a retry-after header. Per-key RPM / TPM limits will land post-MVP.