Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Overview

Decision SnapshotReady For Pilot

The idea is practical and validated on several benchmarks; binary filtering is ready for production trials but multiclass scoring and cross-domain transfer need more validation.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can replace expensive LLM-judge pipelines with cheap probes on small open models to filter and evaluate data at far lower cost while keeping most practical value for downstream fine-tuning.

Who Should Care

ML Engineer Data Scientist Product Manager

Summary TLDR

The paper argues that judging model outputs needs less semantic capacity than generating them. Instead of prompting big LLMs to score answers, the authors probe internal hidden states of small LMs and train light classifiers to predict aspect-level scores from a strong LLM judge. Their INSPECTOR pipeline beats prompt-based small-model evaluation by >20% F1 on reasoning benchmarks and gives reliable binary filters (80–90% F1). Probing works best with mean-pooled PCA features and simple linear classifiers and helps filter training data for fine-tuning with quality comparable to using a large LLM filter.

Problem Statement

Prompting large LLMs to evaluate outputs is costly, opaque, and brittle. Small open models give poor prompt-based evaluations, but may still encode evaluative cues in hidden states. The paper asks whether those latent representations can be probed to produce cheap, reliable evaluations.

Main Contribution

Formalize the Semantic Capacity Asymmetry Hypothesis: evaluation needs less semantic capacity than generation and can be read from intermediate representations.

Introduce Representation-as-a-Judge and INSPECTOR: a pipeline that probes small-LM hidden states and trains lightweight classifiers to match a strong LLM judge.

Key Findings

Probing small-model hidden states improves evaluation F1 over prompt-based inference by a large margin.

NumbersAverage F1 increased by >20% on most tasks

Practical UseIf you currently prompt small models to rate outputs, switch to hidden-state probing to get substantially better scores for the same models.

Evidence RefFig.3; Table 10

Binary (high vs low quality) probes are highly reliable.

NumbersBinary F1 typically 80–92% across models/datasets

Practical UseUse probing classifiers as a cheap, dependable coarse filter for dataset curation before expensive annotation or fine-tuning.

Evidence RefTable 10 (binary-class rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Improvement over prompt-based inference	>20% average F1 increase	prompt-based small-LM inference	>20% F1	GSM8K, MATH, GPQA (weighted avg)	Fig.3; Section 4.2	Fig.3; Table 10
Binary classification F1 (probing)	≈80–92%	prompt-based or tuned small models	substantially higher	various benchmarks (see Table 10)	Table 10 binary rows	Table 10

What To Try In 7 Days

Run a quick probe: extract mean-pooled hidden states from your small LM on 100 example (prompt,response) pairs, apply PCA(50), train a logistic regression to match an LLM judge.

Use the probe as a binary filter (threshold ≥4) to pick high-quality responses and run a small supervised fine-tune on the filtered data.

Measure costs: compare inference time and API spend versus your current prompt-based LLM-as-judge to quantify savings.

Optimization Features

Token Efficiency

decoding-free evaluation reduces token cost

Inference Optimization

avoid autoregressive decoding for evaluationcache hidden states for repeated probing

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zhuochunli/Representation-as-a-judge

Data URLs

GSM8K (Huggingface)MATH (official repo)GPQA (public splits)

Risks & Boundaries

Limitations

Evaluation aspects and prompt templates are hand-chosen and may not cover all tasks or be optimal.

Experiments focus on mathematical/scientific reasoning; results may differ for commonsense, code, or dialog tasks.

When Not To Use

When you need reliable fine-grained (1–5) scores across very different domains without per-domain training.

When you require human-level nuanced justifications rather than coarse quality filtering.

Failure Modes

Probe overfits small balanced datasets and fails on real-world long-tailed distributions.

Probes replicate biases or blind spots of the LLM judge used as 'gold' labels.

Core Entities

Models

DeepSeek-V3 (M_large, judge)Llama-3-8B-Instruct (M_med generator)Qwen3-1.7BQwen3-0.6BLlama-3.2-1B-InstructLlama-3.1-8B-InstructLlama-2-7B-ChatRoBERTa

Metrics

Weighted average F1Binary F1 (high vs low quality)Multiclass F1 (score 1–5)

Datasets

GSM8KMATHGPQAAlpacaEval 2.0

Benchmarks

GSM8KMATHGPQAAlpacaEval 2.0

Context Entities

Models

GPT-style large LLMs (general reference)Sentinel / prior probing works (context)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Probing small-model hidden states improves evaluation F1 over prompt-based inference by a large margin.

Binary (high vs low quality) probes are highly reliable.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding