Pushing the Pareto: SOTA LLM Judges
27th May
7 mins
RESEARCH
An LLM-as-Judge is a language model used to evaluate the output of an AI system against a rubric. The judge consumes some combination of an input, a candidate output, and an evaluation criterion, and emits a verdict: a binary label, a preference between two candidates, a scalar score, or a natural-language critique.
In a world of open-ended outputs and infinite ways to arrive at them, it has become the backbone of evaluation - used in offline benchmarks, online monitoring, RLHF pipelines, and safety guardrails.
The standard approach involves prompting a frontier model with the input and parsing the verdict from the output. It's a quick but dirty way to keep AI in check - it applies universally and takes minutes to set up, but suffers from compounding uncertainty from stacked LLM calls, latency tails in double-digit seconds, and cost scaling linearly with every call. Teams are left torn between continuing manual evals and actually trusting their judges at scale.
The root cause is a structure mismatch. Judging tasks are classification and regression (discriminative) problems with small, closed label spaces. The standard implementation applies a generative model to this discriminative task - sampling from a 100,000-token vocabulary and collapsing back to two (or few) choices. This wasted computation and unnecessary noise makes generative judging expensive, slow, and unreliable.
A second, deeper problem is the objective mismatch. Prompting the same frontier LLM that generated the output with "Is this helpful?" asks it to answer from the same distribution, not from the knowledge of specific labelers whose judgments you're aiming to emulate. These distributions may overlap but aren't identical (hence the need for judgement), and no prompt, few-shot example, or reasoning chain can arrive at labeler-specific signal the model was never exposed to.
The ceiling is set by calibration, not capability, which is why performance across models clusters within the same window. And the initial structural mismatch prevents post-training from being meaningful because of the intermediate noise.
At Emissary, we're committed to bringing the best of ML to AI. LLMs are quick, ML is reliable.
So, we replace the LLM's language modeling head with a discriminative head. The LM backbone supplies zero-shot generalization; the head supplies closed outputs mapping to the judgement task. Inference is a single forward pass - no decoding, no format parsing, no prompt sensitivity, and direct loss signal is easy to learn against. Easily calibrated, fast, cheap.
But this gives rise to a transition problem - an uncanny valley between 0-100 samples, where the frontier LLM is better than the Decision-LM. So we created Semantic Initialization - a way to seed Decision-LMs with the inherent knowledge underlying LLMs, making them as smart at zero-shot while providing latency and cost gains.
Semantic Initialization enables the custom heads of language models to extract informational state from base LLMs through logit distribution analysis. By examining the logit behavior of models across a distribution of prompts, we can set the head weights in a manner that enables the models to match the performance of their base counterparts, with no labelled data.
Unlike ML, AI workloads are incremental, so we created three more graduated learning modes - few-sample (5-50) head-only training, warm-start LoRA (~100 labels, regression-based initialization), and full LoRA (1,000+ labels) - that AI engineers can seamlessly transition across as they generate more labelled feedback. Going from decent to good to great, all in one place.
The four modes are not arbitrary. They map to the four data regimes any team building an LLM-as-Judge will be in:
| Data Regime | Mode | Why |
|---|---|---|
| No labels yet | Zero-shot | Drop-in, matches frontier accuracy |
| A handful of labels | Low-sample (head-only) | Stable, fast, no backbone risk |
| ~100 labels | Warm-start LoRA | Closed-form init unlocks small-N |
| >=1000 labels | Full LoRA fine-tune | Breaks the prompting ceiling |
The progression is smooth. A team can start at zero-shot Emissary on day one, collect 100 labels over a week, correct mistakes and move to warm-start, then graduate to full fine-tune as their label set grows — without changing inference infrastructure or output interface.
We evaluate on two standard LLM-as-Judge benchmarks, each cast as a balanced binary classification task.
score:1/10 split, mapping the original 10-point scale to a binary helpful / not-helpful label. We sample 1,000 examples for training (500 positive / 500 negative), 100 for the low-sample regime (50/50), and 1,000 for testing (500/500). All splits are class-balanced.score:1/5 split, mapping the original 5-point scale to binary. Sampling and class balance match MT-Bench: 1,000 train / 100 low-sample / 1,000 test, all 50/50.This paper covers binary judgment: helpful / not-helpful, pass / fail, safe / unsafe, prefer-A / prefer-B. Binary judgment covers a large share of LLM-as-Judge use cases in practice, and it is the cleanest setting in which to demonstrate that classification beats generation for closed-set decisions.
We are concurrently developing an Emissary variant for scoring judges - scalar and ordinal scores rather than binary labels. Results from that work will be published separately.
We use Qwen3-8B as our base model. We compare Emissary's DLM against two baselines:
Emissary's zero-shot configuration (Qwen3-8B as a classifier) sits inside the same accuracy cluster as frontier APIs - at a fraction of the latency and cost. The plateau is the prompting ceiling, set by the interface, not the model. More parameters, reasoning compute, or a different provider doesn't move it.
| Method | MT-Bench | UltraFeedback | P50 (ms) | Cost / 1K |
|---|---|---|---|---|
| Emissary zero-shot | 0.906 | 0.834 | 65 | $0.06 |
| Sonnet 4.6 0-shot | 0.910 | 0.835 | 1,148 | $2.43 |
| Opus 4.7 0-shot | 0.907 | 0.838 | 1,433 | $5.82 |
| Gemini 3 Flash 0-shot | 0.901 | 0.812 | 1,127 | $0.38 |
| GPT-5.5 0-shot | 0.864 | 0.789 | 902 | $3.67 |
| Method | MT-Bench | UltraFeedback |
|---|---|---|
| Emissary post-train cold-1000 | 0.979 | 0.955 |
| Emissary post-train warm-1000 | 0.972 | 0.949 |
| Best frontier (Opus 4.7 2-shot, think ON) | 0.945 | 0.838 |
When labeled data is available, Emissary DLMs move well past the prompting cluster. The UltraFeedback gap (+11.7 points over the best frontier) is the more telling number - that rubric's labeler-specific signal is only recoverable through training.
| Method | MT-Bench | UltraFeedback |
|---|---|---|
| Emissary post-train cold-100 | 0.857 | 0.833 |
| Emissary post-train warm-100 | 0.930 | 0.889 |
DLM warm-start at 100 examples already exceeds every frontier API on UltraFeedback, collapsing the labeled-data requirement from "thousands" to "low hundreds" - achievable in a single labeling session. At 1000 samples (see section above), the value of warm starting starts to become less predictable.
Thinking mode produced inconsistent, often negligible accuracy changes while multiplying latency 2-10x and cost 1.3-4x. GPT-5.5 thinking-ON scored lower than thinking-OFF across all shot counts. The pattern holds for both frontier APIs and Qwen3-8B generation.
MT-Bench - thinking ON vs. OFF (representative rows)
| Model | Thinking | Accuracy | P50 (ms) | P99 (ms) | Cost / 1K |
|---|---|---|---|---|---|
| GPT-5.5 1-shot | OFF | 0.872 | 906 | 1,850 | $7.81 |
| GPT-5.5 1-shot | ON | 0.862 | 2,355 | 12,301 | $12.01 |
| Sonnet 4.6 1-shot | OFF | 0.936 | 1,231 | 4,520 | $5.20 |
| Sonnet 4.6 1-shot | ON | 0.928 | 1,219 | 18,235 | $5.87 |
| Opus 4.7 2-shot | OFF | 0.933 | 1,551 | 5,336 | $16.65 |
| Opus 4.7 2-shot | ON | 0.945 | 1,740 | 6,951 | $16.87 |
| Qwen3-8B gen 1-shot | OFF | 0.943 | 284 | 380 | $0.24 |
| Qwen3-8B gen 1-shot | ON | 0.948 | 3,891 | 25,555 | $4.46 |
Emissary configurations Pareto-dominate every frontier row: equal or better accuracy, ~15-25x lower latency, and effectively 100x lower cost. The Pareto frontier on this problem is occupied by purpose-built classifiers, not frontier LLMs.
| Configuration | Accuracy | P50 (ms) | Cost / 1K |
|---|---|---|---|
| Emissary post-train cold-1000 | 0.979 | 65 | $0.06 |
| Emissary post-train warm-100 | 0.930 | 65 | $0.06 |
| Emissary zero-shot | 0.906 | 65 | $0.06 |
| Opus 4.7 2-shot, think ON | 0.945 | 1,740 | $16.87 |
| Sonnet 4.6 1-shot, think OFF | 0.936 | 1,231 | $5.20 |
| Sonnet 4.6 0-shot, think OFF | 0.910 | 1,148 | $2.43 |
| GPT-5.5 0-shot, think OFF | 0.864 | 902 | $3.67 |
The diagnosis predicted three things, all confirmed: a flat accuracy cluster across frontier models, a ceiling break when trained on actual labels, and thinking modes failing to help. All three appear cleanly in the data.
Pareto dominance follows from using the right tool for the job. Accuracy improves by training on the exact decision boundary. Latency drops because a single forward pass replaces autoregressive decoding. Cost falls because inference runs on owned hardware rather than per-token billing. None of these gains require novel research - they require taking the structure of the problem seriously.
The standard advice - "use the strongest frontier model you can afford" - is wrong for closed-set judgment tasks. The binding constraint is calibration to labelers, not model capability, and calibration requires effective training, not scaling. Every additional dollar spent on a larger frontier model, a longer reasoning chain, or a more elaborate prompt is a dollar spent moving along a ceiling rather than through it. The teams that recognize this early will ship faster, evaluate more, and trust their pipelines more than the teams that don't.
The implications go beyond cost savings. A judge that runs in 65ms instead of 1,700ms is no longer a batch-time artifact - it becomes a real-time component you can put in the inference path, in safety guardrails, in router logic, in online RLHF loops. A judge that costs $0.06 per thousand calls instead of $16.87 makes 100x more evaluation economically viable, which changes what teams can measure and how often. Quality stops being something you sample and starts being something you observe continuously.
If you're running LLM judges in production today - for evals, monitoring, guardrails, or RLHF - you're almost certainly on the wrong side of the Pareto frontier. We'd love to help you move.
Reach out to us to book a technical deep-dive. The ceiling is real, but it isn't yours to live under.
| Model | Shots | Thinking | Accuracy | P50 (ms) | P99 (ms) | Cost / 1K |
|---|---|---|---|---|---|---|
| GPT-5.5 | 0 | OFF | 0.864 | 902 | 2,765 | $3.67 |
| GPT-5.5 | 0 | ON | 0.856 | 2,344 | 12,604 | $7.80 |
| GPT-5.5 | 1 | OFF | 0.872 | 906 | 1,850 | $7.81 |
| GPT-5.5 | 1 | ON | 0.862 | 2,355 | 12,301 | $12.01 |
| GPT-5.5 | 2 | OFF | 0.870 | 900 | 2,533 | $10.60 |
| GPT-5.5 | 2 | ON | 0.861 | 2,421 | 15,116 | $15.04 |
| Sonnet 4.6 | 0 | OFF | 0.910 | 1,148 | 4,244 | $2.43 |
| Sonnet 4.6 | 0 | ON | 0.906 | 1,223 | 16,426 | $3.20 |
| Sonnet 4.6 | 1 | OFF | 0.936 | 1,231 | 4,520 | $5.20 |
| Sonnet 4.6 | 1 | ON | 0.928 | 1,219 | 18,235 | $5.87 |
| Sonnet 4.6 | 2 | OFF | 0.927 | 1,281 | 3,745 | $7.32 |
| Sonnet 4.6 | 2 | ON | 0.926 | 1,263 | 18,128 | $8.03 |
| Opus 4.7 | 0 | OFF | 0.907 | 1,433 | 5,358 | $5.82 |
| Opus 4.7 | 0 | ON | 0.912 | 1,611 | 8,928 | $6.18 |
| Opus 4.7 | 1 | OFF | 0.938 | 1,519 | 5,253 | $11.68 |
| Opus 4.7 | 1 | ON | 0.936 | 1,668 | 7,032 | $11.90 |
| Opus 4.7 | 2 | OFF | 0.933 | 1,551 | 5,336 | $16.65 |
| Opus 4.7 | 2 | ON | 0.945 | 1,740 | 6,951 | $16.87 |
| Gemini 3 Flash | 0 | OFF | 0.901 | 1,127 | 3,894 | $0.38 |
| Gemini 3 Flash | 0 | ON | 0.902 | 2,623 | 10,141 | $1.54 |
| Gemini 3 Flash | 1 | OFF | 0.887 | 1,126 | 4,306 | $0.82 |
| Gemini 3 Flash | 1 | ON | 0.908 | 2,672 | 12,636 | $1.99 |
| Gemini 3 Flash | 2 | OFF | 0.898 | 1,179 | 4,868 | $1.03 |
| Gemini 3 Flash | 2 | ON | 0.914 | 2,403 | 13,157 | $2.09 |
| Shots | Thinking | Accuracy | P50 (ms) | P99 (ms) | Cost / 1K |
|---|---|---|---|---|---|
| 0 | OFF | 0.937 | 284.27 | 384.34 | $0.24 |
| 0 | ON | 0.9378 | 3,577.03 | 20,394.03 | $4.07 |
| 1 | OFF | 0.943 | 283.95 | 380.34 | $0.24 |
| 1 | ON | 0.9478 | 3,891.20 | 25,555.16 | $4.46 |
| 2 | OFF | 0.940 | 285.89 | 385.07 | $0.24 |
| 2 | ON | 0.9447 | 3,984.79 | 29,678.37 | $4.66 |
| Mode | Train labels | Accuracy | P50 (ms) | P99 (ms) | Cost / 1K |
|---|---|---|---|---|---|
| Zero-shot (Semantic Init) | 0 | 0.906 | 65.41 | 124.44 | $0.06 |
| 5-shot | 0 | 0.767 | - | - | - |
| Cold LoRA (5 ep) | 100 | 0.857 | 53.67 | 136.87 | $0.05 |
| Warm-start LoRA (5 ep) | 100 | 0.930 | - | - | - |
| Cold LoRA (4 ep) | 1,000 | 0.979 | - | - | - |
| Warm-start LoRA (5 ep) | 1,000 | 0.972 | - | - | - |
| Model | Accuracy |
|---|---|
| GPT-5.5 | 0.789 |
| Sonnet 4.6 | 0.835 |
| Opus 4.7 | 0.838 |
| Gemini 3 Flash | 0.8118 |
| Method | Accuracy |
|---|---|
| Qwen3-8B generation (no-think) | 0.884 |
| Qwen3-8B generation (think) | 0.8855 |
| Mode | Train labels | Accuracy |
|---|---|---|
| Zero-shot (Semantic Init) | 0 | 0.834 |
| 5-shot | 0 | 0.8607 |
| Cold LoRA (5 ep) | 100 | 0.833 |
| Warm-start LoRA (5 ep) | 100 | 0.889 |
| Cold LoRA (4 ep) | 1,000 | 0.955 |
| Warm-start LoRA (5 ep) | 1,000 | 0.949 |
System:
1You are an impartial MT-Bench-style judge. Decide whether the candidate answer is helpful for the latest user request. 2 3Use MT-Bench-style response-quality factors: helpfulness, relevance, accuracy, depth, appropriate creativity, and level of detail. 4 5Return HELPFUL if the candidate answer substantially satisfies the latest user request, stays on task, is factually and logically correct, and provides enough useful detail for the task. Minor wording or style issues are acceptable only when they do not affect usefulness. 6 7Return NOT_HELPFUL if the candidate answer fails a key requirement, is incorrect, irrelevant, evasive, refuses without good reason, contradicts the provided context, or is too incomplete or shallow to be useful. 8 9For math, coding, reasoning, and factual tasks, prioritize correctness and whether the final answer solves the requested problem. If previous conversation context is provided, use it only to understand the latest user request; judge the candidate answer itself. 10 11Output exactly one token: HELPFUL or NOT_HELPFUL. No punctuation, no explanation.
User:
1Task / conversation context: 2Previous conversation: 3User: ... 4Assistant: ... 5 6Latest user request: 7... 8 9Candidate answer: 10<candidate answer> 11 12Final verdict (HELPFUL or NOT_HELPFUL):
Few-shot examples are appended in the standard Example N: ... Final verdict: ... format before the test item.
System:
1You are evaluating whether an AI assistant's response is helpful. 2 3A response is HELPFUL only when it meets ALL of these criteria: 41. Clarity and Relevance: Does the response directly address the task and remain on-topic? 52. Useful and Comprehensive Information: Does the response provide relevant background, reasoning, or detailed explanation that improves understanding? 63. Not Lengthy, No Repetition: Is the response concise without unnecessary repetition while still being comprehensive? 7 8If the response clearly satisfies all three, answer HELPFUL. 9If it clearly fails on any one - off-topic, vague/incomplete, or bloated/repetitive - answer NOT_HELPFUL. 10 11Output exactly one word: HELPFUL or NOT_HELPFUL. No punctuation, no explanation.
User:
1Evaluate the following instruction-response pair against the helpfulness criteria. 2 3Instruction: 4{question} 5 6Response: 7{answer} 8 9Verdict (HELPFUL or NOT_HELPFUL):
© 2026 Emissary. All rights reserved.