Pushing the Pareto: SOTA LLM Judges

27th May

7 mins

RESEARCH


The LLM-as-Judge

An LLM-as-Judge is a language model used to evaluate the output of an AI system against a rubric. The judge consumes some combination of an input, a candidate output, and an evaluation criterion, and emits a verdict: a binary label, a preference between two candidates, a scalar score, or a natural-language critique.

In a world of open-ended outputs and infinite ways to arrive at them, it has become the backbone of evaluation - used in offline benchmarks, online monitoring, RLHF pipelines, and safety guardrails.

The standard approach involves prompting a frontier model with the input and parsing the verdict from the output. It's a quick but dirty way to keep AI in check - it applies universally and takes minutes to set up, but suffers from compounding uncertainty from stacked LLM calls, latency tails in double-digit seconds, and cost scaling linearly with every call. Teams are left torn between continuing manual evals and actually trusting their judges at scale.

Why It's Broken

The root cause is a structure mismatch. Judging tasks are classification and regression (discriminative) problems with small, closed label spaces. The standard implementation applies a generative model to this discriminative task - sampling from a 100,000-token vocabulary and collapsing back to two (or few) choices. This wasted computation and unnecessary noise makes generative judging expensive, slow, and unreliable.

A second, deeper problem is the objective mismatch. Prompting the same frontier LLM that generated the output with "Is this helpful?" asks it to answer from the same distribution, not from the knowledge of specific labelers whose judgments you're aiming to emulate. These distributions may overlap but aren't identical (hence the need for judgement), and no prompt, few-shot example, or reasoning chain can arrive at labeler-specific signal the model was never exposed to.

The ceiling is set by calibration, not capability, which is why performance across models clusters within the same window. And the initial structural mismatch prevents post-training from being meaningful because of the intermediate noise.

The Solution: Decision Language Models

At Emissary, we're committed to bringing the best of ML to AI. LLMs are quick, ML is reliable.

So, we replace the LLM's language modeling head with a discriminative head. The LM backbone supplies zero-shot generalization; the head supplies closed outputs mapping to the judgement task. Inference is a single forward pass - no decoding, no format parsing, no prompt sensitivity, and direct loss signal is easy to learn against. Easily calibrated, fast, cheap.

But this gives rise to a transition problem - an uncanny valley between 0-100 samples, where the frontier LLM is better than the Decision-LM. So we created Semantic Initialization - a way to seed Decision-LMs with the inherent knowledge underlying LLMs, making them as smart at zero-shot while providing latency and cost gains.

Semantic Initialization enables the custom heads of language models to extract informational state from base LLMs through logit distribution analysis. By examining the logit behavior of models across a distribution of prompts, we can set the head weights in a manner that enables the models to match the performance of their base counterparts, with no labelled data.

Unlike ML, AI workloads are incremental, so we created three more graduated learning modes - few-sample (5-50) head-only training, warm-start LoRA (~100 labels, regression-based initialization), and full LoRA (1,000+ labels) - that AI engineers can seamlessly transition across as they generate more labelled feedback. Going from decent to good to great, all in one place.

Why These Four Modes

The four modes are not arbitrary. They map to the four data regimes any team building an LLM-as-Judge will be in:

Data RegimeModeWhy
No labels yetZero-shotDrop-in, matches frontier accuracy
A handful of labelsLow-sample (head-only)Stable, fast, no backbone risk
~100 labelsWarm-start LoRAClosed-form init unlocks small-N
>=1000 labelsFull LoRA fine-tuneBreaks the prompting ceiling

The progression is smooth. A team can start at zero-shot Emissary on day one, collect 100 labels over a week, correct mistakes and move to warm-start, then graduate to full fine-tune as their label set grows — without changing inference infrastructure or output interface.

Experimental Setup

Datasets

We evaluate on two standard LLM-as-Judge benchmarks, each cast as a balanced binary classification task.

  • MT-Bench. We use the score:1/10 split, mapping the original 10-point scale to a binary helpful / not-helpful label. We sample 1,000 examples for training (500 positive / 500 negative), 100 for the low-sample regime (50/50), and 1,000 for testing (500/500). All splits are class-balanced.
  • UltraFeedback 64K. We use the score:1/5 split, mapping the original 5-point scale to binary. Sampling and class balance match MT-Bench: 1,000 train / 100 low-sample / 1,000 test, all 50/50.

This paper covers binary judgment: helpful / not-helpful, pass / fail, safe / unsafe, prefer-A / prefer-B. Binary judgment covers a large share of LLM-as-Judge use cases in practice, and it is the cleanest setting in which to demonstrate that classification beats generation for closed-set decisions.

We are concurrently developing an Emissary variant for scoring judges - scalar and ordinal scores rather than binary labels. Results from that work will be published separately.

Baselines

We use Qwen3-8B as our base model. We compare Emissary's DLM against two baselines:

  1. Base-model prompting. Zero-shot and k-shot prompting using the same base model that we apply our techniques to. This isolates the effect of the classification-first approach from the effect of model choice.
  2. Frontier LLM APIs. GPT-5.5, Sonnet 4.6, Opus 4.7, and Gemini 3 Flash. Each is evaluated at 0-shot, 1-shot, and 2-shot, with thinking modes both enabled and disabled where applicable. The prompts used are reproduced in Appendix B.

Metrics

  • Accuracy. Agreement with held-out human labels. Because the task is calibrated to a specific labeler population, accuracy here is the most direct measure of human alignment for the judge.
  • Latency. End-to-end wall-clock time per judgment, reported as P50 and P99 in milliseconds. For both API baselines and Emissary, this includes network round-trip.
  • Cost. US dollars per 1,000 judgments at published API pricing as of the experiment date. Emissary's per-call inference cost is dominated by amortized infrastructure rather than per-call billing.

Results

Zero-Shot Parity: Frontier Capability Is Not the Bottleneck

Emissary's zero-shot configuration (Qwen3-8B as a classifier) sits inside the same accuracy cluster as frontier APIs - at a fraction of the latency and cost. The plateau is the prompting ceiling, set by the interface, not the model. More parameters, reasoning compute, or a different provider doesn't move it.

MethodMT-BenchUltraFeedbackP50 (ms)Cost / 1K
Emissary zero-shot0.9060.83465$0.06
Sonnet 4.6 0-shot0.9100.8351,148$2.43
Opus 4.7 0-shot0.9070.8381,433$5.82
Gemini 3 Flash 0-shot0.9010.8121,127$0.38
GPT-5.5 0-shot0.8640.789902$3.67

Post-Training Breaks the Ceiling

MethodMT-BenchUltraFeedback
Emissary post-train cold-10000.9790.955
Emissary post-train warm-10000.9720.949
Best frontier (Opus 4.7 2-shot, think ON)0.9450.838

When labeled data is available, Emissary DLMs move well past the prompting cluster. The UltraFeedback gap (+11.7 points over the best frontier) is the more telling number - that rubric's labeler-specific signal is only recoverable through training.

Warm-Start Makes 100 Labels Sufficient & decreases in value with increase in labels.

MethodMT-BenchUltraFeedback
Emissary post-train cold-1000.8570.833
Emissary post-train warm-1000.9300.889

DLM warm-start at 100 examples already exceeds every frontier API on UltraFeedback, collapsing the labeled-data requirement from "thousands" to "low hundreds" - achievable in a single labeling session. At 1000 samples (see section above), the value of warm starting starts to become less predictable.

Reasoning Compute Does Not Reliably Help

Thinking mode produced inconsistent, often negligible accuracy changes while multiplying latency 2-10x and cost 1.3-4x. GPT-5.5 thinking-ON scored lower than thinking-OFF across all shot counts. The pattern holds for both frontier APIs and Qwen3-8B generation.

MT-Bench - thinking ON vs. OFF (representative rows)

ModelThinkingAccuracyP50 (ms)P99 (ms)Cost / 1K
GPT-5.5 1-shotOFF0.8729061,850$7.81
GPT-5.5 1-shotON0.8622,35512,301$12.01
Sonnet 4.6 1-shotOFF0.9361,2314,520$5.20
Sonnet 4.6 1-shotON0.9281,21918,235$5.87
Opus 4.7 2-shotOFF0.9331,5515,336$16.65
Opus 4.7 2-shotON0.9451,7406,951$16.87
Qwen3-8B gen 1-shotOFF0.943284380$0.24
Qwen3-8B gen 1-shotON0.9483,89125,555$4.46

The Pareto Picture

Emissary configurations Pareto-dominate every frontier row: equal or better accuracy, ~15-25x lower latency, and effectively 100x lower cost. The Pareto frontier on this problem is occupied by purpose-built classifiers, not frontier LLMs.

ConfigurationAccuracyP50 (ms)Cost / 1K
Emissary post-train cold-10000.97965$0.06
Emissary post-train warm-1000.93065$0.06
Emissary zero-shot0.90665$0.06
Opus 4.7 2-shot, think ON0.9451,740$16.87
Sonnet 4.6 1-shot, think OFF0.9361,231$5.20
Sonnet 4.6 0-shot, think OFF0.9101,148$2.43
GPT-5.5 0-shot, think OFF0.864902$3.67

Conclusion

The diagnosis predicted three things, all confirmed: a flat accuracy cluster across frontier models, a ceiling break when trained on actual labels, and thinking modes failing to help. All three appear cleanly in the data.

Pareto dominance follows from using the right tool for the job. Accuracy improves by training on the exact decision boundary. Latency drops because a single forward pass replaces autoregressive decoding. Cost falls because inference runs on owned hardware rather than per-token billing. None of these gains require novel research - they require taking the structure of the problem seriously.

The standard advice - "use the strongest frontier model you can afford" - is wrong for closed-set judgment tasks. The binding constraint is calibration to labelers, not model capability, and calibration requires effective training, not scaling. Every additional dollar spent on a larger frontier model, a longer reasoning chain, or a more elaborate prompt is a dollar spent moving along a ceiling rather than through it. The teams that recognize this early will ship faster, evaluate more, and trust their pipelines more than the teams that don't.

The implications go beyond cost savings. A judge that runs in 65ms instead of 1,700ms is no longer a batch-time artifact - it becomes a real-time component you can put in the inference path, in safety guardrails, in router logic, in online RLHF loops. A judge that costs $0.06 per thousand calls instead of $16.87 makes 100x more evaluation economically viable, which changes what teams can measure and how often. Quality stops being something you sample and starts being something you observe continuously.

Get Started with Emissary

If you're running LLM judges in production today - for evals, monitoring, guardrails, or RLHF - you're almost certainly on the wrong side of the Pareto frontier. We'd love to help you move.

  • Try Emissary on your own data. Bring a metric + criteria and optionally, a few samples; we'll show you the zero-shot, warm-start, and full fine-tune numbers side-by-side against whatever frontier baseline you're using now. Or you can try it out here yourself: withemissary.com/demo
  • Talk to us about scoring judges. If your use case needs scalar or ordinal scores rather than binary labels, our forthcoming variant is in active development and we're onboarding design partners.

Reach out to us to book a technical deep-dive. The ceiling is real, but it isn't yours to live under.

Appendix A: Full Results

A.1 MT-Bench - Frontier APIs

ModelShotsThinkingAccuracyP50 (ms)P99 (ms)Cost / 1K
GPT-5.50OFF0.8649022,765$3.67
GPT-5.50ON0.8562,34412,604$7.80
GPT-5.51OFF0.8729061,850$7.81
GPT-5.51ON0.8622,35512,301$12.01
GPT-5.52OFF0.8709002,533$10.60
GPT-5.52ON0.8612,42115,116$15.04
Sonnet 4.60OFF0.9101,1484,244$2.43
Sonnet 4.60ON0.9061,22316,426$3.20
Sonnet 4.61OFF0.9361,2314,520$5.20
Sonnet 4.61ON0.9281,21918,235$5.87
Sonnet 4.62OFF0.9271,2813,745$7.32
Sonnet 4.62ON0.9261,26318,128$8.03
Opus 4.70OFF0.9071,4335,358$5.82
Opus 4.70ON0.9121,6118,928$6.18
Opus 4.71OFF0.9381,5195,253$11.68
Opus 4.71ON0.9361,6687,032$11.90
Opus 4.72OFF0.9331,5515,336$16.65
Opus 4.72ON0.9451,7406,951$16.87
Gemini 3 Flash0OFF0.9011,1273,894$0.38
Gemini 3 Flash0ON0.9022,62310,141$1.54
Gemini 3 Flash1OFF0.8871,1264,306$0.82
Gemini 3 Flash1ON0.9082,67212,636$1.99
Gemini 3 Flash2OFF0.8981,1794,868$1.03
Gemini 3 Flash2ON0.9142,40313,157$2.09

A.2 MT-Bench - Qwen3-8B Generation Baseline

ShotsThinkingAccuracyP50 (ms)P99 (ms)Cost / 1K
0OFF0.937284.27384.34$0.24
0ON0.93783,577.0320,394.03$4.07
1OFF0.943283.95380.34$0.24
1ON0.94783,891.2025,555.16$4.46
2OFF0.940285.89385.07$0.24
2ON0.94473,984.7929,678.37$4.66

A.3 MT-Bench - Emissary Decision-LM (Qwen3-8B backbone)

ModeTrain labelsAccuracyP50 (ms)P99 (ms)Cost / 1K
Zero-shot (Semantic Init)00.90665.41124.44$0.06
5-shot00.767---
Cold LoRA (5 ep)1000.85753.67136.87$0.05
Warm-start LoRA (5 ep)1000.930---
Cold LoRA (4 ep)1,0000.979---
Warm-start LoRA (5 ep)1,0000.972---

A.4 UltraFeedback - Frontier APIs

ModelAccuracy
GPT-5.50.789
Sonnet 4.60.835
Opus 4.70.838
Gemini 3 Flash0.8118

A.5 UltraFeedback - Qwen3-8B Generation Baseline

MethodAccuracy
Qwen3-8B generation (no-think)0.884
Qwen3-8B generation (think)0.8855

A.6 UltraFeedback - Emissary Decision-LM (Qwen3-8B backbone)

ModeTrain labelsAccuracy
Zero-shot (Semantic Init)00.834
5-shot00.8607
Cold LoRA (5 ep)1000.833
Warm-start LoRA (5 ep)1000.889
Cold LoRA (4 ep)1,0000.955
Warm-start LoRA (5 ep)1,0000.949

Appendix B: Prompts Used for Frontier API Baselines

B.1 MT-Bench Prompt

System:

1You are an impartial MT-Bench-style judge. Decide whether the candidate answer is helpful for the latest user request.
2
3Use MT-Bench-style response-quality factors: helpfulness, relevance, accuracy, depth, appropriate creativity, and level of detail.
4
5Return HELPFUL if the candidate answer substantially satisfies the latest user request, stays on task, is factually and logically correct, and provides enough useful detail for the task. Minor wording or style issues are acceptable only when they do not affect usefulness.
6
7Return NOT_HELPFUL if the candidate answer fails a key requirement, is incorrect, irrelevant, evasive, refuses without good reason, contradicts the provided context, or is too incomplete or shallow to be useful.
8
9For math, coding, reasoning, and factual tasks, prioritize correctness and whether the final answer solves the requested problem. If previous conversation context is provided, use it only to understand the latest user request; judge the candidate answer itself.
10
11Output exactly one token: HELPFUL or NOT_HELPFUL. No punctuation, no explanation.

User:

1Task / conversation context:
2Previous conversation:
3User: ...
4Assistant: ...
5
6Latest user request:
7...
8
9Candidate answer:
10<candidate answer>
11
12Final verdict (HELPFUL or NOT_HELPFUL):

Few-shot examples are appended in the standard Example N: ... Final verdict: ... format before the test item.

B.2 UltraFeedback Prompt

System:

1You are evaluating whether an AI assistant's response is helpful.
2
3A response is HELPFUL only when it meets ALL of these criteria:
41. Clarity and Relevance: Does the response directly address the task and remain on-topic?
52. Useful and Comprehensive Information: Does the response provide relevant background, reasoning, or detailed explanation that improves understanding?
63. Not Lengthy, No Repetition: Is the response concise without unnecessary repetition while still being comprehensive?
7
8If the response clearly satisfies all three, answer HELPFUL.
9If it clearly fails on any one - off-topic, vague/incomplete, or bloated/repetitive - answer NOT_HELPFUL.
10
11Output exactly one word: HELPFUL or NOT_HELPFUL. No punctuation, no explanation.

User:

1Evaluate the following instruction-response pair against the helpfulness criteria.
2
3Instruction:
4{question}
5
6Response:
7{answer}
8
9Verdict (HELPFUL or NOT_HELPFUL):

© 2026 Emissary. All rights reserved.