API Reference

SUB&SUB exposes a multi-provider relay at https://api.subnsub.com/v1. OpenAI clients hit /v1/chat/completions; Anthropic clients hit /v1/messages. The same sk-cf-... key routes both — pick the model in the request body and the relay picks the upstream.

Service availability

Existing accounts only API access is currently limited to accounts created before June 8, 2026 (Beijing time). New registrations can use the shared account and SUB&SUB Tools, but cannot enter the API console, create an API key, add API credit, or call the relay. This section will be updated when API onboarding reopens.

Quick start

For an API-enabled existing account, you need three things:

Base URL: https://api.subnsub.com/v1 (OpenAI clients) or https://api.subnsub.com (Anthropic clients — the SDK appends /v1/messages itself)
API key: sk-cf-... issued from the console
Model: one of the 16 supported models — e.g. gpt-5.4-mini or claude-sonnet-5

Authentication

Every request must carry an Authorization: Bearer sk-cf-... header. Keys are issued from the console and stored as SHA-256 hashes — once you leave the creation screen, the plaintext is gone forever, so save it immediately.

Tip Generate one key per integration (chatbot, IDE plugin, batch job). Revoking a leaked key in the console takes effect within seconds.

Endpoints

The stable public surface is described below and in the machine-readable OpenAPI 3.1 document. Fields not listed here may be forwarded upstream, but are not automatically part of SUB&SUB's compatibility contract.

POST /v1/chat/completions

POST/v1/chat/completions

Send a chat completion request. Request shape matches the OpenAI Chat Completions API — the OpenAI SDKs work unmodified.

Parameter	Type	Description
model	string	One of the verified model IDs.
messages	array	Conversation history. Each item: `{role, content}` with `role` ∈ `system / user / assistant`.
stream	boolean	If `true`, response is sent as SSE chunks. See Streaming.
stream_options	object	Optional. The relay always forces `{include_usage: true}` upstream so the final chunk carries the token-usage block — overriding it has no effect.
max_tokens	integer	Cap completion length. Defaults to the model's maximum.
temperature	number	0 – 2. Higher = more random.

POST /v1/responses

POST/v1/responses

OpenAI Responses API — the newer OpenAI request shape (client.responses.create(...)). Works with every catalogue model: gpt-* natively, claude-* through the same compatibility bridge as chat/completions. Usage is metered identically — input/output tokens at the model's tier rate.

Parameter	Type	Description
model	string	Any catalogue model ID.
input	string \| array	The prompt — a plain string or the structured item list the Responses API defines.
max_output_tokens	integer	Caps response length (reasoning + visible output combined).
reasoning	object	`{"effort": "..."}` — same five values as reasoning_effort.
stream	boolean	If `true`, streams the standard Responses SSE sequence: `response.created`, `response.output_text.delta`, …, `response.completed`.
background	boolean	Not supported. `background: true` returns `400 unsupported_background_mode` — the relay only serves synchronous runs.

Heads-up The :online web-search suffix has no effect on this endpoint — the suffix is stripped but no search context is injected (queries are extracted from messages, which Responses requests don't carry). Use /v1/chat/completions or /v1/messages for web search.

Runnable Responses example:

curl https://api.subnsub.com/v1/responses \
  -H "Authorization: Bearer sk-cf-xxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.4-mini",
    "input": "Explain exponential backoff in two sentences."
  }'

POST /v1/messages

POST/v1/messages

Anthropic-native endpoint for the claude-* models — the Anthropic SDK (anthropic-sdk-python, @anthropic-ai/sdk, claude-code) works unmodified against this path. Point your base URL at https://api.subnsub.com and authenticate via the x-api-key header (the Authorization-Bearer form works too, if your client prefers it).

Parameter	Type	Description
model	string	A `claude-*` model ID (see Available models). Passing an OpenAI model here returns `400 invalid_request_error`.
max_tokens	integer	Required by Anthropic — caps the assistant reply length.
messages	array	Conversation history, Anthropic shape: `{role, content}` with `role` ∈ `user / assistant`.
stream	boolean	If `true`, returns the standard Anthropic SSE event sequence: `message_start`, `content_block_delta`, `message_delta`, `message_stop`.
thinking	object	Forwarded verbatim to Anthropic. Use `{"type":"adaptive"}` where supported; Fable 5 always uses adaptive thinking even when this field is omitted. There are no synthetic `-thinking` model IDs.
cache_control	object	Prompt-caching is supported. Cache-write tokens bill at 1.25× and cache-read tokens at 0.10× the tier's input rate.

Heads-up Claude requests are served directly by official Anthropic accounts. Use the exact official model IDs listed below.

Runnable Anthropic Messages example:

curl https://api.subnsub.com/v1/messages \
  -H "x-api-key: sk-cf-xxxxxxxxxxxx" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-5",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

POST /v1/messages/count_tokens

POST/v1/messages/count_tokens

Count an Anthropic-format prompt before sending it. Use the same x-api-key, anthropic-version, model, system, messages, and tools fields you would send to /v1/messages. This endpoint is not billed. A :online suffix is stripped, but search results are not fetched or counted.

curl https://api.subnsub.com/v1/messages/count_tokens \
  -H "x-api-key: sk-cf-xxxxxxxxxxxx" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-5",
    "messages": [{"role": "user", "content": "Count this prompt."}]
  }'

GET /v1/models

GET/v1/models

List the models you can actually use. The relay health-checks both upstream families and returns the 16 verified public IDs — the same whitelist the POST endpoints enforce, so discovery never advertises a model that would 400. If the upstream catalogue is unreachable the endpoint returns 502 models_unreachable rather than a misleading empty list.

# sample response (truncated)
{
  "object": "list",
  "data": [
    { "id": "gpt-5.4-mini",      "type": "model", ... },
    { "id": "gpt-5.4",           "type": "model", ... },
    { "id": "claude-sonnet-5",     "type": "model", ... },
    { "id": "claude-fable-5",      "type": "model", ... },
    ...
  ]
}

Compatibility contract

OpenAI-compatible does not mean that every field offered by every upstream model is guaranteed on every route. Use these three support levels:

Status	Detail
Documented & stable	Text generation on the four endpoints above; synchronous and streaming responses; documented reasoning controls; Anthropic prompt caching; `:online` on Chat Completions and Messages; authentication, metering, and the documented error envelopes.
Pass-through, model-dependent	Tool/function calling, strict tools, structured output / JSON Schema, sampling controls, stop sequences, multipart content (including images or documents), and model context/output limits. The edge forwards these fields without local validation, but upstream support and exact response shape can differ by model and protocol. Test your exact model and payload before production; no cross-provider normalisation is promised.
Not offered	Background Responses runs; `:online` on Responses; OpenAI image-generation, audio, Realtime, Batch, Files, Embeddings, and Moderation APIs; synthetic Claude `-thinking` aliases; and the `minimal` OpenAI reasoning effort.

Tip Treat openapi.json plus this page as the supported contract. A field accepted by one upstream today can still be withdrawn there without becoming a permanent SUB&SUB guarantee.

Available models

Two upstream families. The 7 OpenAI models route to shared ChatGPT-tier accounts; the 9 Claude models are served by official Anthropic accounts. Per-token rates depend on the tier (see Pricing) — the same key works for both.

OpenAI

Model ID	Family	Tier	Notes
gpt-5.4-mini	GPT-5.4	Mini	Fast & cheap. Recommended default for chat & coding.
gpt-5.4	GPT-5.4	Standard	Full-size GPT-5.4 — slower, stronger reasoning.
gpt-5.4-2026-03-05	GPT-5.4	Standard	Date-stamped snapshot of `gpt-5.4`.
gpt-5.5	GPT-5.5	Premium	Newer flagship.
gpt-5.6-luna	GPT-5.6	Luna	Lightweight GPT-5.6 — between Mini and Standard.
gpt-5.6-terra	GPT-5.6	Standard	Mid-size GPT-5.6 — same rate as `gpt-5.4`.
gpt-5.6-sol	GPT-5.6	Premium	Top GPT-5.6 — same rate as `gpt-5.5`.

Anthropic

Model ID	Family	Tier	Notes
claude-fable-5	Fable 5	Fable	Anthropic's most capable widely released model; adaptive thinking is always on.
claude-haiku-4-5-20251001	Haiku 4.5	Mini	Smallest Claude — same per-token rate as gpt-5.4-mini.
claude-sonnet-4-5-20250929	Sonnet 4.5	Standard	Mid-tier Claude — same per-token rate as gpt-5.4.
claude-sonnet-4-6	Sonnet 4.6	Standard	Newer Sonnet tune — Standard tier, same rate as sonnet-4.5.
claude-sonnet-5	Sonnet 5	Sonnet 5 Intro	Latest Sonnet; introductory pricing applies through August 31, 2026.
claude-opus-4-5-20251101	Opus 4.5	Ultra	Frontier Claude. Billed at Anthropic's list price — no margin (see Pricing).
claude-opus-4-6	Opus 4.6	Ultra	Newer Opus tune.
claude-opus-4-7	Opus 4.7	Ultra	Previous Opus snapshot.
claude-opus-4-8	Opus 4.8	Ultra	Latest Opus snapshot.

Heads-up The Claude catalogue uses official Anthropic model IDs. Prompt caching is supported: cache writes bill at 1.25× and reads at 0.10× the tier's input rate (see Pricing).

Fable refusals Fable 5 safety classifiers can return HTTP 200 with stop_reason: "refusal" and an empty content array. Clients should branch on stop_reason, not HTTP status alone, and retry the request with claude-opus-4-8. Through the OpenAI-protocol endpoints the same result is exposed as finish_reason: "content_filter" (chat/completions) or status: "incomplete" with incomplete_details.reason: "content_filter" (responses). Prompt-stage refusals are not deducted from your balance; a mid-stream refusal after partial output is billed normally.

Not available Retired OpenAI IDs (gpt-5.2* and gpt-5.3-codex*), the bare gpt-5.6 alias (use the named variants above), OpenAI Pro/image/audio/realtime variants, dot-notation IDs (for example claude-sonnet-4.5), and synthetic -thinking model IDs are not available. Use the exact IDs above and Anthropic's native thinking field.

Reasoning effort

Every OpenAI model above is a reasoning model — the backend can spend more or fewer "thinking" tokens before emitting visible output. Set reasoning_effort on the OpenAI /v1/chat/completions request body (or reasoning: {"effort": ...} on /v1/responses) to control the budget. For Claude, use the Anthropic-native thinking and output_config.effort fields — see the /v1/messages section. The OpenAI models accept the same five effort values:

Value	Behavior
none	No thinking — straight to the answer. Cheapest and fastest.
low	A short reasoning pass.
medium	Default if you don't pass the field. Balanced.
high	Deeper reasoning. Recommended for non-trivial coding / multi-step problems.
xhigh	Maximum effort. Slowest and most expensive; reserve for hard analysis where you genuinely need it.

# Two equivalent forms — pick whichever your SDK supports
{
  "model": "gpt-5.4-mini",
  "reasoning_effort": "high",
  "messages": [ ... ]
}

{
  "model": "gpt-5.5",
  "reasoning": { "effort": "xhigh" },
  "messages": [ ... ]
}

Cost Thinking tokens count as output tokens for billing — higher effort = more output tokens = a bigger bill on the same prompt. The per-token rate doesn't change.

Heads-up The OpenAI protocol also defines 'minimal', but the models on this relay reject it: "'minimal' is not supported with this model". Stick to the five values above.

Streaming

Set "stream": true to receive Server-Sent Events. The final chunk carries a usage block (we force stream_options.include_usage upstream so token counts are always emitted), then a literal data: [DONE] closes the stream.

# Streaming format (line by line)
data: {"id":"resp_...","choices":[{"delta":{"content":"Hi"}}]}

data: {"id":"resp_...","choices":[{"delta":{"content":"!"}}]}

data: {"id":"resp_...","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":11,"total_tokens":29}}

data: [DONE]

Runnable Python streaming example:

from openai import OpenAI

client = OpenAI(
    api_key="sk-cf-xxxxxxxxxxxx",
    base_url="https://api.subnsub.com/v1",
)

stream = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)
for chunk in stream:
    text = chunk.choices[0].delta.content if chunk.choices else None
    if text:
        print(text, end="", flush=True)

Web search

Append :online to any model ID supported by the endpoint and the relay will run a web search before forwarding to the model, prepending the results to the conversation so the answer is grounded in fresh data. The suffix works on /v1/chat/completions and /v1/messages (the latter still requires a claude-* base); no search-specific request fields are required.

# Same call as before — just :online on the model
curl https://api.subnsub.com/v1/chat/completions \
  -H "Authorization: Bearer sk-cf-xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.4-mini:online",
    "messages": [
      {"role": "user", "content": "What did Anthropic ship this week?"}
    ]
  }'

How it works: the relay strips :online, takes the most recent user message as the query (capped at 400 characters), calls Tavily for up to 3 results with extracted page text when available, plus an optional Tavily-generated summary, then prepends them to that same user turn as a clearly-delimited <search_results> block before sending the request upstream. The search call has an 8-second timeout. Results are deliberately injected into the user role — never the system prompt — so untrusted snippets can't be elevated to system-priority instructions.

The <search_results> block looks like this. It's preceded by a one-line instruction telling the model to treat the block as untrusted external data and cite numbered items inline:

<search_results query="What did Anthropic ship this week?" retrieved="2026-05-21">
Summary: <short LLM-generated synthesis of the result set>

[1] Anthropic launches Opus 4.8
URL: https://www.anthropic.com/news/opus-4-8
<extracted page text, or short snippet if extraction failed — up to ~2000 chars>

[2] ...
</search_results>

Behavior	Detail
Cost	No surcharge today — you pay the model's normal per-token rate; the relay absorbs the search call. The injected `<search_results>` block does count as input tokens, so expect a higher prompt-token bill than the same question without `:online`.
Failure mode	Soft. If Tavily times out or errors, the request continues to the model without search context (you still get an answer, just ungrounded). The only hard failure is `503 search_unavailable` when search isn't configured on the relay at all.
count_tokens	`/v1/messages/count_tokens` strips the suffix but never calls Tavily — the count reflects your original prompt, not the augmented one.
Multi-turn	Only the last user turn is queried & augmented; earlier turns are untouched. To search again, send a new user message with `:online` still on the model.

When to use :online

The relay does a single Tavily call per request and injects the results — it is not an agentic search loop. The model does not decide to re-search based on what it sees, the way Perplexity Sonar or the ChatGPT browse tool do. Plan around that limitation:

Good fit	Bad fit
Time-sensitive facts (news, prices, version numbers, release dates)	Private or pasted code that isn't on the public web — adds prompt noise without grounding
Locating an official doc or announcement	Math, reasoning, translation, creative writing — nothing to ground
Anything you would otherwise verify by Googling	Stable knowledge already in training data ("what is a binary tree")

Phrase the last user message as a standalone search query. The search runs against the literal text of your most recent user turn (capped at 400 chars), so conversational follow-ups like "and what about the latest version?" become useless queries with no context. In a multi-turn chat, restate the topic when you add :online — e.g. "latest version of the Anthropic Python SDK" rather than "the latest one".

For questions that need multi-step synthesis (compare-and-contrast, deep research), break them into multiple turns and add :online to each. The model will read each turn's fresh results; you steer the next query manually. Note that the injected <search_results> block is sent upstream only — it isn't echoed back to your client and isn't preserved into the next request, so if a later turn depends on details from earlier sources, ask the model to summarise them in its visible reply. One-shot research mode is not supported.

Tip Combine with high reasoning effort (reasoning_effort: "high") so the model actually weighs the returned sources rather than leaning on the first result. The injected instruction asks the model to cite numbered sources as [1], [2] inline, so the output will usually carry such citations — though the model isn't strictly bound to that format.

Errors

The envelope depends on which endpoint you called — the relay returns errors in the protocol that matches the caller's SDK, and upstream errors are passed through verbatim.

OpenAI paths (/v1/chat/completions, /v1/responses, /v1/models) — OpenAI envelope:

{ "error": { "message": "...", "type": "...", "code": "..." } }

Anthropic paths (/v1/messages, /v1/messages/count_tokens) — Anthropic envelope:

{ "type": "error", "error": { "type": "...", "message": "..." } }

The Anthropic envelope uses a different shape — no code field, and the discriminator type: "error" is at the top level (with the inner error.type giving the category, e.g. authentication_error, invalid_request_error, permission_error, api_error). Anthropic SDKs already parse this shape; vanilla OpenAI SDK error handlers won't, so call /v1/messages with an Anthropic SDK (or do raw HTTP).

Status codes are the canonical HTTP ones across both protocols:

Status	OpenAI `code` / Anthropic `error.type`	Meaning
401	invalid_api_key / authentication_error	Missing or unknown `sk-cf-...` key.
402	insufficient_balance / permission_error	Account balance is negative. Top up in the console billing tab.
403	key_revoked / permission_error	The key was revoked.
403	account_closed / permission_error	The account isn't enabled for API access — sign-ups after the 2026-06-08 service cutoff don't include API service.
400	model_not_available / invalid_request_error	The `model` you sent isn't in the verified catalogue, or is wrong for the endpoint (e.g. an OpenAI model on `/v1/messages`) — check Available models.
400	unsupported_background_mode / —	`background: true` on /v1/responses — the relay only serves synchronous runs. OpenAI envelope only.
429	rate_limit_exceeded / rate_limit_error	Shared upstream capacity is temporarily throttled. Honour `retry-after` when present, then retry with exponential backoff and jitter.
503	—	No upstream account currently serves the request — usually a transient pool-wide rate-limit window. Retry after a short backoff.
503	search_unavailable / api_error	You used `:online` but web search isn't configured on this relay. See Web search.
502	upstream_unreachable / api_error	Relay couldn't reach the backend. Retry after a short backoff.
500	server_error / api_error	The relay failed before or after contacting the upstream. Retry only if the operation is safe to repeat; otherwise inspect your usage history first.

Retries & reliability

Use bounded retries. The relay is backed by shared upstream capacity, and generation requests are not idempotent.

Retry: 429, 502, 503, and clearly transient 500 responses. Honour retry-after; otherwise use exponential backoff with jitter (for example 1 s, 2 s, 4 s; at most three attempts).
Do not retry unchanged: 400, 401, 402, or 403. Fix the payload, key, balance, or access state first.
Duplicate risk: every successful generation attempt is a separate billable request. SUB&SUB does not currently deduplicate generation POSTs by an idempotency key, so keep an application-level operation ID and avoid retrying after a complete response.
Streaming: an interrupted SSE stream cannot be resumed. Reconnecting starts a new generation and may incur a second charge.

Pricing & billing

Pay-as-you-go, billed per token in microdollars (1 micro = $0.000001 = 1/10,000 of a cent) so sub-cent requests are tracked accurately. Rates are per 1M tokens, by tier — see the model table for which tier each model maps to.

Tier	Models	Input / 1M	Output / 1M
Mini	gpt-5.4-mini, claude-haiku-4-5-20251001	$0.20	$1.60
Luna	gpt-5.6-luna	$0.30	$2.40
Standard	gpt-5.4, gpt-5.4-2026-03-05, gpt-5.6-terra, claude-sonnet-4-5-20250929, claude-sonnet-4-6	$0.75	$6.00
Premium	gpt-5.5, gpt-5.6-sol	$1.10	$8.80
Sonnet 5 Intro	claude-sonnet-5	$2.00	$10.00
Ultra	claude-opus-4-5-20251101, claude-opus-4-6, claude-opus-4-7, claude-opus-4-8	$5.00	$25.00
Fable	claude-fable-5	$10.00	$50.00

Fable and Ultra rates match Anthropic's published list prices. Sonnet 5 uses Anthropic's introductory $2/$10 rate through August 31, 2026; its published standard price after that date is $3/$15. The other tiers run below upstream rates thanks to the pooled subscription backing.

Reasoning tokens (when you set reasoning_effort on OpenAI, or Anthropic's native thinking field on Claude) count as output tokens at the model's tier rate — there's no separate surcharge for high effort, but a deep-thinking request can easily emit 10–50× more output tokens than a no-effort one, so the dollar bill scales with it.

Anthropic prompt-caching bills as a separate line item: cache writes at 1.25× and cache reads at 0.10× the tier's input rate. So a haiku-4.5 cache hit costs 0.20 × 0.10 = $0.02 per 1M tokens, and a sonnet-4.5 cache hit costs 0.75 × 0.10 = $0.075 per 1M tokens. Cache tokens are itemised on every request's billing record — the console shows the breakdown.

Balance is deducted in real time as each request returns — for streaming requests, settlement runs after the [DONE] chunk lands. View your live balance and per-request settlements at /console#billing.

Top-up Console supports Stripe Checkout — card, Link, Alipay, WeChat Pay. Credits never expire.

Rate limits

No per-key rate limits today. Shared upstream capacity and provider-side throttling still apply; if you hit those, the relay returns 429 with a retry-after header. Per-key RPM / TPM limits are planned.

Status & support

Signed-in API-enabled accounts can see live provider health under Console → Service Status and operational announcements under System Notice.
For account, billing, privacy, or security help, email [email protected].
When reporting an API failure, include the UTC timestamp, endpoint, model, HTTP status, and the visible API-key prefix. Never send the full API key or prompt content unless support explicitly asks for a redacted reproduction.
The service is best-effort and has no SLA. See the English-only Terms of Service for availability and refund rules, and the Privacy Policy for data handling.

Documentation last reviewed: July 14, 2026.