SUB&SUB · Docs

API Reference

SUB&SUB exposes a multi-provider relay at https://api.subnsub.com/v1. OpenAI clients hit /v1/chat/completions; Anthropic clients hit /v1/messages. The same sk-cf-... key routes both — pick the model in the request body and the relay picks the upstream.

Quick start

Three things you need:

  1. Base URL: https://api.subnsub.com/v1 (OpenAI clients) or https://api.subnsub.com (Anthropic clients — the SDK appends /v1/messages itself)
  2. API key: sk-cf-... issued from the console
  3. Model: one of the 21 verified models — e.g. gpt-5.4-mini or claude-haiku-4.5

Authentication

Every request must carry an Authorization: Bearer sk-cf-... header. Keys are issued from the console and stored as SHA-256 hashes — once you leave the creation screen, the plaintext is gone forever, so save it immediately.

Tip Generate one key per integration (chatbot, IDE plugin, batch job). Revoking a leaked key in the console takes effect within seconds.

Endpoints

POST /v1/chat/completions

POST/v1/chat/completions

Send a chat completion request. Request shape matches the OpenAI Chat Completions API — the OpenAI SDKs work unmodified.

ParameterTypeDescription
modelstringOne of the verified model IDs.
messagesarrayConversation history. Each item: {role, content} with rolesystem / user / assistant.
streambooleanIf true, response is sent as SSE chunks. See Streaming.
stream_optionsobjectOptional. The relay always forces {include_usage: true} upstream so the final chunk carries the token-usage block — overriding it has no effect.
max_tokensintegerCap completion length. Defaults to the model's maximum.
temperaturenumber0 – 2. Higher = more random.

POST /v1/messages

POST/v1/messages

Anthropic-native endpoint for the claude-* models — the Anthropic SDK (anthropic-sdk-python, @anthropic-ai/sdk, claude-code) works unmodified against this path. Point your base URL at https://api.subnsub.com and authenticate via the x-api-key header (the Authorization-Bearer form works too, if your client prefers it).

ParameterTypeDescription
modelstringA claude-* model ID (see Available models). Passing an OpenAI model here returns 400 invalid_request_error.
max_tokensintegerRequired by Anthropic — caps the assistant reply length.
messagesarrayConversation history, Anthropic shape: {role, content} with roleuser / assistant.
streambooleanIf true, returns the standard Anthropic SSE event sequence: message_start, content_block_delta, message_delta, message_stop.
thinkingobjectForwarded verbatim. Pair with a -thinking model variant to enable extended thinking.
cache_controlobjectPrompt-caching is supported. Cache-write tokens bill at 1.25× and cache-read tokens at 0.10× the tier's input rate.
Heads-up Each Claude request carries a ~4100-token Kiro system prompt upstream — it's billed as ordinary input tokens at the model's tier rate. Short prompts that would otherwise be cheap still incur the floor.

GET /v1/models

GET/v1/models

List the models you can actually call. The relay merges the OpenAI and Anthropic catalogues from sub2api and filters down to the 21 we have end-to-end-tested against the current account pool — phantom IDs that 503 at routing time, or upstream-400 on first token, are hidden. If every configured upstream is unreachable the endpoint returns 502 models_unreachable rather than a misleading empty list.

# sample response (truncated)
{
  "object": "list",
  "data": [
    { "id": "gpt-5.4-mini",      "type": "model", ... },
    { "id": "gpt-5.4",           "type": "model", ... },
    { "id": "claude-sonnet-4.5", "type": "model", ... },
    { "id": "claude-haiku-4.5",  "type": "model", ... },
    ...
  ]
}

Available models

Two upstream families. The 7 OpenAI models route to shared ChatGPT-tier accounts; the 14 Claude models route through a Kiro reverse proxy onto AWS CodeWhisperer. Per-token rates depend on the tier (see Pricing) — the same key works for both.

OpenAI

Model IDFamilyTierNotes
gpt-5.4-mini GPT-5.4Mini Fast & cheap. Recommended default for chat & coding.
gpt-5.3-codex Codex Mini Coding-tuned 5.3. Same price as mini.
gpt-5.2 GPT-5.2StandardStable 5.2.
gpt-5.2-chat-latestGPT-5.2StandardAuto-tracks latest 5.2 chat tune (currently maps upstream to gpt-5.2).
gpt-5.4 GPT-5.4StandardFull-size GPT-5.4 — slower, stronger reasoning.
gpt-5.4-2026-03-05GPT-5.4StandardDate-stamped snapshot of gpt-5.4.
gpt-5.5 GPT-5.5Premium Newer flagship.

Anthropic

Model IDFamilyTierNotes
claude-haiku-4.5 Haiku 4.5 Mini Smallest Claude — same per-token rate as gpt-5.4-mini.
claude-haiku-4.5-thinking Haiku 4.5 Mini Extended-thinking variant of haiku-4.5. Pair with the thinking request field.
claude-sonnet-4.5 Sonnet 4.5 StandardMid-tier Claude — same per-token rate as gpt-5.4.
claude-sonnet-4.5-thinking Sonnet 4.5 StandardExtended-thinking variant of sonnet-4.5.
claude-sonnet-4.6 Sonnet 4.6 StandardNewer Sonnet tune — Standard tier, same rate as sonnet-4.5.
claude-sonnet-4.6-thinking Sonnet 4.6 StandardExtended-thinking variant of sonnet-4.6.
claude-opus-4.5 Opus 4.5 Ultra Frontier Claude. Billed at Anthropic's list price — no margin (see Pricing).
claude-opus-4.5-thinking Opus 4.5 Ultra Extended-thinking variant of opus-4.5 (adaptive thinking).
claude-opus-4.6 Opus 4.6 Ultra Newer Opus tune.
claude-opus-4.6-thinking Opus 4.6 Ultra Extended-thinking variant of opus-4.6.
claude-opus-4.7 Opus 4.7 Ultra Previous Opus snapshot.
claude-opus-4.7-thinking Opus 4.7 Ultra Extended-thinking variant of opus-4.7.
claude-opus-4.8 Opus 4.8 Ultra Latest Opus snapshot.
claude-opus-4.8-thinking Opus 4.8 Ultra Extended-thinking variant of opus-4.8.
Heads-up Each Claude request carries a ~4100-token Kiro system prompt as input — included in your token bill. Anthropic prompt-caching is supported: cache writes bill at 1.25× and reads at 0.10× the tier's input rate (see Pricing).
Not available Pro variants (gpt-5.2-pro, gpt-5.2-pro-2025-12-11), gpt-4o, gpt-4o-mini, gpt-5, gpt-5-mini, image / audio / realtime variants, claude-haiku-4-6, and dated upstream IDs (e.g. claude-sonnet-4-5-20250929). Calling them returns 400 model_not_available. Pro models are off the menu because the underlying social-tier accounts in the pool would exhaust their tiny quotas in a handful of requests.

Reasoning effort

Every OpenAI model above is a reasoning model — the backend can spend more or fewer "thinking" tokens before emitting visible output. Set reasoning_effort on the OpenAI /v1/chat/completions request body to control the budget. For Claude, use the Anthropic-native thinking request field (or pick a -thinking model variant) — see the /v1/messages section. The OpenAI models accept the same five effort values:

ValueBehavior
none No thinking — straight to the answer. Cheapest and fastest.
low A short reasoning pass.
medium Default if you don't pass the field. Balanced.
high Deeper reasoning. Recommended for non-trivial coding / multi-step problems.
xhigh Maximum effort. Slowest and most expensive; reserve for hard analysis where you genuinely need it.
# Two equivalent forms — pick whichever your SDK supports
{
  "model": "gpt-5.4-mini",
  "reasoning_effort": "high",
  "messages": [ ... ]
}

{
  "model": "gpt-5.5",
  "reasoning": { "effort": "xhigh" },
  "messages": [ ... ]
}
Cost Thinking tokens count as output tokens for billing — higher effort = more output tokens = a bigger bill on the same prompt. The per-token rate doesn't change.
Heads-up The OpenAI protocol also defines 'minimal', but our pool's models reject it: "'minimal' is not supported with this model". Stick to the five values above.

Streaming

Set "stream": true to receive Server-Sent Events. The final chunk carries a usage block (we force stream_options.include_usage upstream so token counts are always emitted), then a literal data: [DONE] closes the stream.

# Streaming format (line by line)
data: {"id":"resp_...","choices":[{"delta":{"content":"Hi"}}]}

data: {"id":"resp_...","choices":[{"delta":{"content":"!"}}]}

data: {"id":"resp_...","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":11,"total_tokens":29}}

data: [DONE]

Append :online to any model ID supported by the endpoint and the relay will run a web search before forwarding to the model, prepending the results to the conversation so the answer is grounded in fresh data. The suffix works on /v1/chat/completions and /v1/messages (the latter still requires a claude-* base); no search-specific request fields are required.

# Same call as before — just :online on the model
curl https://api.subnsub.com/v1/chat/completions \
  -H "Authorization: Bearer sk-cf-xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.4-mini:online",
    "messages": [
      {"role": "user", "content": "What did Anthropic ship this week?"}
    ]
  }'

How it works: the relay strips :online, takes the most recent user message as the query (capped at 400 characters), calls Tavily for up to 3 results with extracted page text when available, plus an optional Tavily-generated summary, then prepends them to that same user turn as a clearly-delimited <search_results> block before sending the request upstream. The search call has an 8-second timeout. Results are deliberately injected into the user role — never the system prompt — so untrusted snippets can't be elevated to system-priority instructions.

The <search_results> block looks like this. It's preceded by a one-line instruction telling the model to treat the block as untrusted external data and cite numbered items inline:

<search_results query="What did Anthropic ship this week?" retrieved="2026-05-21">
Summary: <short LLM-generated synthesis of the result set>

[1] Anthropic launches Opus 4.8
URL: https://www.anthropic.com/news/opus-4-8
<extracted page text, or short snippet if extraction failed — up to ~2000 chars>

[2] ...
</search_results>
BehaviorDetail
CostNo surcharge today — you pay the model's normal per-token rate; the relay absorbs the search call. The injected <search_results> block does count as input tokens, so expect a higher prompt-token bill than the same question without :online.
Failure modeSoft. If Tavily times out or errors, the request continues to the model without search context (you still get an answer, just ungrounded). The only hard failure is 503 search_unavailable when search isn't configured on the relay at all.
count_tokens/v1/messages/count_tokens strips the suffix but never calls Tavily — the count reflects your original prompt, not the augmented one.
Multi-turnOnly the last user turn is queried & augmented; earlier turns are untouched. To search again, send a new user message with :online still on the model.

When to use :online

The relay does a single Tavily call per request and injects the results — it is not an agentic search loop. The model does not decide to re-search based on what it sees, the way Perplexity Sonar or the ChatGPT browse tool do. Plan around that limitation:

Good fitBad fit
Time-sensitive facts (news, prices, version numbers, release dates)Private or pasted code that isn't on the public web — adds prompt noise without grounding
Locating an official doc or announcementMath, reasoning, translation, creative writing — nothing to ground
Anything you would otherwise verify by GooglingStable knowledge already in training data ("what is a binary tree")

Phrase the last user message as a standalone search query. The search runs against the literal text of your most recent user turn (capped at 400 chars), so conversational follow-ups like "and what about the latest version?" become useless queries with no context. In a multi-turn chat, restate the topic when you add :online — e.g. "latest version of the Anthropic Python SDK" rather than "the latest one".

For questions that need multi-step synthesis (compare-and-contrast, deep research), break them into multiple turns and add :online to each. The model will read each turn's fresh results; you steer the next query manually. Note that the injected <search_results> block is sent upstream only — it isn't echoed back to your client and isn't preserved into the next request, so if a later turn depends on details from earlier sources, ask the model to summarise them in its visible reply. One-shot research mode is not supported.

Tip Combine with high reasoning effort (reasoning_effort: "high") so the model actually weighs the returned sources rather than leaning on the first result. The injected instruction asks the model to cite numbered sources as [1], [2] inline, so the output will usually carry such citations — though the model isn't strictly bound to that format.

Errors

The envelope depends on which endpoint you called — the relay returns errors in the protocol that matches the caller's SDK, and upstream errors are passed through verbatim.

OpenAI paths (/v1/chat/completions, /v1/responses, /v1/models) — OpenAI envelope:

{ "error": { "message": "...", "type": "...", "code": "..." } }

Anthropic paths (/v1/messages, /v1/messages/count_tokens) — Anthropic envelope:

{ "type": "error", "error": { "type": "...", "message": "..." } }

The Anthropic envelope uses a different shape — no code field, and the discriminator type: "error" is at the top level (with the inner error.type giving the category, e.g. authentication_error, invalid_request_error, permission_error, api_error). Anthropic SDKs already parse this shape; vanilla OpenAI SDK error handlers won't, so call /v1/messages with an Anthropic SDK (or do raw HTTP).

Status codes are the canonical HTTP ones across both protocols:

StatusOpenAI code / Anthropic error.typeMeaning
401invalid_api_key / authentication_errorMissing or unknown sk-cf-... key.
402insufficient_balance / permission_errorAccount balance is negative. Top up in the console billing tab.
403key_revoked / permission_errorThe key was revoked.
400model_not_available / invalid_request_errorThe model you sent isn't in the verified catalogue, or is wrong for the endpoint (e.g. an OpenAI model on /v1/messages) — check Available models.
503No upstream account currently serves the request — usually a pool-wide rate-limit window, not a config issue.
503search_unavailable / api_errorYou used :online but web search isn't configured on this relay. See Web search.
502upstream_unreachable / api_errorRelay couldn't reach the backend. Retry after a short backoff.

Pricing & billing

Pay-as-you-go, billed per token in microdollars (1 micro = $0.000001 = 1/10,000 of a cent) so sub-cent requests are tracked accurately. Rates are per 1M tokens, by tier — see the model table for which tier each model maps to.

TierModelsInput / 1MOutput / 1M
Mini gpt-5.4-mini, gpt-5.3-codex, claude-haiku-4.5, claude-haiku-4.5-thinking $0.20$1.60
Standardgpt-5.2, gpt-5.2-chat-latest, gpt-5.4, gpt-5.4-2026-03-05, claude-sonnet-4.5, claude-sonnet-4.5-thinking, claude-sonnet-4.6, claude-sonnet-4.6-thinking$0.75$6.00
Premium gpt-5.5 $1.10$8.80
Ultra claude-opus-4.5, claude-opus-4.5-thinking, claude-opus-4.6, claude-opus-4.6-thinking, claude-opus-4.7, claude-opus-4.7-thinking, claude-opus-4.8, claude-opus-4.8-thinking$5.00$25.00

Ultra-tier rates match Anthropic's published Opus list price — straight pass-through, no margin. The other tiers run below their upstream rates thanks to the pooled subscription backing.

Reasoning tokens (when you set reasoning_effort on OpenAI, or use a Claude -thinking variant) count as output tokens at the model's tier rate — there's no separate surcharge for high effort, but a deep-thinking request can easily emit 10–50× more output tokens than a no-effort one, so the dollar bill scales with it.

Anthropic prompt-caching bills as a separate line item: cache writes at 1.25× and cache reads at 0.10× the tier's input rate. So a haiku-4.5 cache hit costs 0.20 × 0.10 = $0.02 per 1M tokens, and a sonnet-4.5 cache hit costs 0.75 × 0.10 = $0.075 per 1M tokens. Cache columns are recorded on each settle row so the console can show the breakdown.

Balance is deducted in real time as each request returns — for streaming requests, settlement runs after the [DONE] chunk lands. View your live balance and per-request settlements at /console#billing.

Top-up Console supports Stripe Checkout — card, Link, Alipay, WeChat Pay (whatever's enabled on the Stripe account). Credits never expire.

Rate limits

No per-key rate limits today. The upstream account pool concurrency & OpenAI server-side throttling apply; if you hit those, the relay returns 429 with a retry-after header. Per-key RPM / TPM limits will land post-MVP.