Sending every request to a frontier model is the simplest setup you can ship and the most expensive one you can run. A request that summarizes a paragraph or sorts a ticket into one of five buckets costs the same per token as a multi-step reasoning task, because the model doesn't know the easy ones were easy. When easy requests dominate your traffic, you pay the frontier rate on work that never needed it.
The fix is a thin router in front of your calls that classifies each request's difficulty, picks a model tier, and sends everything through the AI Gateway endpoint, so swapping models is a string edit, and all metrics still land in a single dashboard. By the end, you will have a routeTier() router, a model-agnostic generate() call, a difficulty classifier, escalation that triggers only when a cheap answer fails a verifiable check, per-tier budget caps, and a way to watch spend so you can tune the thresholds.
Before you begin, ensure you have:
- Node.js 20+
- A Next.js 16 project
- A Vercel account
Cost-aware routing has three moving parts, and only one of them is the logic you write.
- The AI SDK comes first. You pass it a model identifier in
creator/modelformat, likeopenai/gpt-5.4-mini, and it automatically routes the call through AI Gateway. There is no per-provider SDK to install and no base URL to juggle, because the string itself is the routing instruction. - The classifier is the part you own. It maps each incoming request to a tier, either a fast, cheap model or a frontier model, using signals you choose. It can be a few lines of heuristics or a small model call, and it returns a tier plus a confidence score.
- AI Gateway sits underneath both. It centralizes authentication, provider routing, spend budgets, and observability behind one endpoint, with automatic retries and failover when a provider is slow or down. Because the Gateway handles all of that, your router has exactly one job: to decide which model string to pass.
AI Gateway authenticates requests using Vercel OIDC tokens, which Vercel generates and links to your project automatically. Use the Vercel CLI to link the project and pull environment variables:
These commands add the VERCEL_OIDC_TOKEN to .env.local, which the SDK automatically detects and uses for authentication:
You can use any model by specifying its name in the model option. To test the set up, try a streamText() call:
A TIERS config object is the single place model choices live, which is what keeps the rest of the code model-agnostic. Every other function in the router reads from this object instead of hard-coding a model string, so changing a model is a change in one file.
Each tier gets a primary model and a fallback from a different provider, which is what the router falls back to when a primary call fails.
Pick models by cost per token and latency for the job the tier does. The fast tier wants something quick and inexpensive that handles summarization, classification, and extraction well. The frontier tier wants the strongest reasoning you are willing to pay for. You can see the current per-model prices in the AI Gateway model catalog, which is the source of truth as prices move.
The classifier decides which tier a request goes to, and the cheapest classifier is the one that never makes an extra model call. There are two ways to build it, and they sit at different points on the latency-versus-accuracy curve. Start with the heuristic, because it costs nothing, and reach for a model call only where the heuristic is wrong often enough to matter.
A heuristic classifier reads cheap signals off the request itself and returns a tier. Input length, the presence of fenced code, and intent keywords get you a long way, and because it is plain code, it adds no latency and no token cost.
Treat the thresholds, the confidence numbers, and the keyword list as starting placeholders, not defaults that work out of the box. A fenced-code check still misses inline snippets, a keyword list misses every phrasing you didn't anticipate, and a length cutoff is a proxy at best. Tune all of them against your own traffic and the misroute rates you observe, because the right values depend on what your requests actually look like.
The confidence score matters as much as the tier: a request that the heuristic is sure about routes directly, whereas another near the boundary, the one with middling length and no clear signal, comes back with low confidence, and that is the request you hand to the second variant.
When the heuristic returns low confidence, a small and fast model can read the request and classify it directly. This costs one extra call, so reserve it for the ambiguous cases the heuristic flagged rather than running it on every request.
The heuristic path adds zero latency and zero cost. The model-call path adds one round trip, so it only pays for itself when it stops you from misrouting an expensive request to the wrong tier. Run the heuristic first, then escalate to the model classifier only on low confidence, so the extra call fires on ambiguous requests rather than on every one.
routeTier() turns a classified request into a tier, and one generation call does the rest. It reads from TIERS, so the routing logic never names a model directly, which is what keeps swaps to a one-line config edit. When the primary call fails, the same TIERS entry supplies a fallback from a different provider, so a provider-specific failure does not take the request down with it.
The generation call does not know or care which tier it resolved to. It receives a string and returns text. That is what keeps the loop model-agnostic, since classification chooses a tier, the TIERS entry resolves the tier to a string, and the call site stays the same when you swap what that string points to.
The try/catch here is your own fallback to a second provider. The Gateway also does its own provider routing with automatic retries and failover, so a transient failure at one provider is often handled before your code ever sees an error. The two layers stack, with the Gateway absorbing most provider hiccups and your fallback catching the rest.
Another cost saving approach is to start on the fast tier and escalate to frontier only when a quality gate fails. This is the difference between paying for capability you actually used and paying for capability you guessed you might need.
Gate on something verifiable: a structured-output schema that fails to validate, an explicit refusal or "I'm not sure" string in the answer, a required field that comes back empty, a result that doesn't parse. Each of these is evidence that the cheap answer fell short, independent of what the model thinks of its own work.
You can then implement an escalation matrix, as shown below:
Budgets give each tier a hard spend ceiling, so a misrouted flood of frontier calls hits a cap and gets rejected instead of producing a surprise invoice. AI Gateway budgets are set per API key, so the way to cap a tier is to give each tier its own key. Your TIERS config already separates tiers, so this falls out naturally. Create one Gateway API key for the fast tier and one for the frontier tier, read each from its own environment variable, and set a budget on each key in the dashboard.
The Gateway key is fixed at the provider-instance level, so routing each tier through its own budgeted key means building a separate gateway instance per tier. Each instance reads its own environment variable.
Adopting per-tier keys is a wiring change across the whole generate() path, not a drop-in. The block below illustrates the conversion on generate(), routing both its primary and fallback legs through the resolved tier's instance. Miss the fallback leg and a failover escapes the tier's budget, which is the exact thing this section exists to prevent. The same gateway(tier) wrapping has to be applied to every other model call site you want counted against a tier's budget, including the small-model classifier in classify-model.ts and the escalation calls in escalate.ts, since any call left on the bare ambient key escapes the per-tier cap.
This per-tier version is canonical once you adopt budgets. Convert every call site that should be capped, because the budget only binds on the calls that route through gateway(...), and in production you should not run ambient-key and per-tier calls for the same tier side by side to ensure budget cap is enforced.
The budget is a soft cap checked at the start of each request. The request that crosses the limit still completes, so you never get a half-finished response. After that, the Gateway returns HTTP 402 and rejects further requests on that key until the budget resets or you raise it. Handle the rejection explicitly rather than letting it surface as an unhandled error.
Per-key budgets still apply when you bring your own provider keys. With Bring Your Own Key (BYOK), the Gateway routes through your own provider credentials at zero markup on token prices, and it captures cost, token, and request data the same way it does for Gateway-billed traffic. BYOK calls show up in the same reporting as everything else, so switching a tier to your own credentials doesn't create a blind spot.
You cannot tune routing you cannot see, and the AI Gateway observability view shows which model served each request. That is the data you tune against, because the classifier thresholds you set are guesses until you check them against real traffic.
The observability data tells you four things worth acting on:
- Cost per model, so you know where the spend concentrates
- Latency distribution per model, so you can see when the classifier call is dominating
- Which model served each request, so you can audit individual routing decisions
- The split between tiers, so you know what fraction of traffic took the fast path
The tier split is the number to watch. If almost everything resolves to frontier, your classifier is too cautious and you are leaving savings on the table. If quality complaints are rising, the fast tier is answering requests it should escalate, and your gate is too loose.
You can also attribute cost by injecting user and tags metadata into requests, which lets you slice spend by customer, feature, or environment when you need to know who the cost belongs to. Reading these numbers and adjusting the thresholds is how routing gets better over time.
You can further improve cost optimization and extend upon the current solution with these three techniques:
- Caching repeated prompts: If your traffic includes the same prompts over and over, serving a cached response avoids the model call entirely, which is cheaper than any tier. Prompt caching lives at the upstream provider, not in the AI Gateway itself, and the Gateway passes it through by default. See the AI Gateway documentation for which providers fall into each group.
- A/B testing: When you want to move a class of requests from frontier to fast, you can split traffic and compare cost and quality across the two paths instead of flipping the whole config at once, using the Flags SDK to control the split. This is different than using evals before deployment because it operates on production inputs.
- Advanced Failover handling: The
generate()function above already falls back to a second provider on error, and the Gateway adds its own automatic retries and failover underneath. You can extend the same idea to ordering preferences and per-tier retry policy as your reliability needs grow, so a single provider's outage stays survivable without changing how the router decides tiers.
Most routing issues trace back to a too-aggressive classifier, a too-low cap, or a typo in a model string.
The classifier threshold is too aggressive, or the confidence gate is set too high, so requests the fast tier could handle are being routed up. Loosen the heuristic signals or lower the confidence bar and check the tier split again.
The cap on that key is set too low for your real traffic, or your token estimate is off and you are hitting the ceiling earlier than expected. Raise the cap on the key or recheck how you are estimating tokens per request.
The creator/model string has a typo or names a model that is not on the Gateway. Check the string against the model catalog character for character.
The small-model classifier call is dominating your request time because it is running too often. Confirm the heuristic runs first and that the model classifier only fires on low confidence, or drop to the heuristic-only path.
- AI SDK documentation
- AI Gateway: Model fallbacks
- AI Gateway: Automatic Caching
- AI Gateway: Observability
It depends on which classifier path a request takes. The heuristic classifier is plain code that reads signals off the request, so it adds no measurable latency and no token cost. The small-model classifier adds one model call, which is real latency, so it should run only on the ambiguous requests the heuristic flagged rather than on every request. Routed well, the bulk of traffic takes the zero-latency path.
Bring Your Own Key routes through your own provider credentials at zero markup on token prices, and the Gateway continues to capture cost, token, and request data for those calls, so BYOK traffic shows up in the same reporting as everything else.
A budget cap stops spend by rejecting requests once you hit a ceiling, which protects you from a surprise invoice but does nothing for the requests that come in under the cap. Routing reduces spend on every request by sending easy ones to a cheaper model while preserving quality on the hard ones. The two are complementary, and you should use both, routing to lower the bill and a budget to guard the ceiling.
Yes. routeTier() is just code that returns a tier, so it can branch on anything you can read at request time, including the user's plan, account tier, or feature flags. You might route paid plans to the frontier tier by default and free plans to the fast tier. The cost attribution data is keyed by user and tags, so you can see what each segment costs once you route this way.