Bigger Isn't Better: Building a Model Router That Knows Your Agent
15th Jun
4 mins
BLOG
As AI teams wrangle with rapidly increasing bills, cost management has gone from a back-office concern to a first-order engineering problem. For fast-scaling companies, inference spend can get out of hand rapidly, and the obvious lever is to stop sending every request to the most expensive model available.
The common approach is model routing: directing requests to different models based on some criteria. Most implementations today are static. Teams map sub-agents or workflow components to perceived complexity (vibe-routing), then assume that more expensive models are better at everything - so the cheaper models get handed the "simple" tasks by default.
There are two core issues with this approach:
The first is one every agent team has already faced - this approach relies on clearly separable, consistent steps. Workflows can easily be routed statically, but agents with dynamic control flows may have complexity variations for the same unit task depending on the input. And with long trajectories, it’s hard to know even which sub-components are consuming resources.
The second is that the static approach rests on a tidy mental model: capability is monotonic with price. Spend more, get more, everywhere. The reality is messier. Different models, independent of price, are better suited to different kinds of tasks. A cheaper model may beat a flagship at structured extraction, or terse function-calling, or a particular language, while losing badly at long-horizon reasoning. Capability isn't a single ladder - it's a surface with peaks and valleys that don't line up neatly with cost. We call this the jagged frontier.
But there must be a better way - how do you build a router for agents existing on jagged frontiers?
Everything downstream depends on having a reliable signal for "which model did better on this task." That signal comes from an LLM judge. The hard part isn't standing one up, that’s just a prompt – it's trusting it. Before the judge labels anything that drives routing decisions, validate it against human preferences on a held-out set, measure agreement rates, and check it for the usual failure modes: position bias, length bias, and self-preference. A judge you haven't calibrated is just a confident source of mislabeled training data. Keep testing, training, testing, training till you’re confident in its ability. This may feel misdirected to those who’ve spent most of their time till now focused squarely on improving systems by gut, but you can’t improve when you don’t know (or can’t measure) what good looks like.
You can get started by creating your first judge here on Emissary!
You can only route among the models you can actually reach - those gated by your API access, latency budget, and compliance constraints. Now that you have a ruler you trust, you can rapidly evaluate models for your task. So probe that set of models, not the theoretical universe of models to get a concrete idea of the frontier.
Construct a simple matrix of tasks against models, drawn from traffic that looks like your production distribution, and score every cell with your judge. This is where the jaggedness becomes visible: you'll see models that win categories you'd never have assigned them under a static scheme. The matrix is your map of the accessible frontier.
A trustworthy judge and mapped accessible frontier is the foundation to building an effective model router. The probing from step 2 is the dataset for training a context-sensitive router. An effective router adds negligible cost and latency overhead and minimizes misfiring.
At Emissary, we train model routers by formulating the problem as a discriminative task of predicting the likelihood of a specific model being able to complete a given step of the trajectory of an agent/workflow. This also allows us to keep the router extensible, trivially adding a head and training it independently when a new model is released or enters your accessible frontier - using the data generation pipeline in step 2.
The two key metrics we measure are:
An effective router will NOT require a trade-off between these two metrics, but instead exploit the jagged frontier, deflecting a sizable (25%+) fraction of requests, while having a positive outcome delta.
Ready to build your first model router? Start here:
With an effective model router in hand, it’s time to start capturing value. There are two key considerations in serving model routers:
Minimizing Overhead: A model router is fundamentally a tax on the agent stack - as such, minimizing its cost and latency overhead is critical to ensure net positive impact. Cost is a function of the size of the router - the GPU required to serve it. Latency is a combination of the model size + the network latency. The optimal approach here is to use the smallest model possible, and serve it as close to your backend as possible.
For reference, Emissary’s model routers operate at $0.15/M tokens (<5% of Claude Sonnet) and ~50-100ms of roundtrip latency, depending on the location of your backend server.
Routing Policy: An effective model router is decoupled from the routing policy - allowing for greater flexibility given the rapid changes in downstream vendor pricing, and providing greater flexibility across tenants (who might have different defaults or accessible models). We suggest two approaches:
We generally recommend adopting the first for low-stakes use cases, and the second for higher stakes use cases.
The frontier moves every few weeks. A new model ships and reshapes the surface, and a router frozen at training time slowly drifts out of date. It is important to architect for this from the start: make adding a model an incremental operation - add a head to the router and expand its output space - rather than a full retrain. Ideally, pair that with periodic re-probing, so model drifts are captured and accounted for.
Everything above prices a request by its nominal per-token cost. Prompt caching breaks that assumption, and a router that ignores it will confidently make decisions that lose money.
You overestimate the cache discount when you route away. Agent workloads carry a large, stable prefix — system prompt, tool definitions, accumulated conversation — reused across many calls. With caching, reading that prefix back costs a fraction of the base input price (21% on Claude after write premium, 50% on OpenAI). So the "expensive" default model may not be expensive on a request whose prefix is already warm in its cache. It is important to keep in mind though, that TTLs are ~5 minutes, so even slightly stale conversations render at full price and the bulk of cost is output tokens, not input tokens - so caches may be less valuable than perceived.
Cache-aware routing fixes this by routing on effective cost, not sticker price. Two adjustments:
This biases the router toward stickiness when valuable, which composes cleanly with deviate when certain. Now you deviate only when the cheaper model is both likely to succeed and genuinely cheaper after the cache math, not merely cheaper on paper.
Model routing is about to become a critical component of every AI toolkit. Static routing optimizes against a model of the world that doesn't hold: that price predicts capability universally. Routing for a jagged frontier replaces that assumption with measurement — a trusted judge, a probed frontier, a learned router, and an architecture that expects the frontier to keep moving. The payoff is the one the bills demanded in the first place: lower cost and better task-fit, instead of trading one for the other.
© 2026 Emissary. All rights reserved.