Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

January 30, 20267 min

Overview

Decision SnapshotReady For Pilot

The idea is practical and validated on several benchmarks; binary filtering is ready for production trials but multiclass scoring and cross-domain transfer need more validation.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can replace expensive LLM-judge pipelines with cheap probes on small open models to filter and evaluate data at far lower cost while keeping most practical value for downstream fine-tuning.

Who Should Care

Summary TLDR

The paper argues that judging model outputs needs less semantic capacity than generating them. Instead of prompting big LLMs to score answers, the authors probe internal hidden states of small LMs and train light classifiers to predict aspect-level scores from a strong LLM judge. Their INSPECTOR pipeline beats prompt-based small-model evaluation by >20% F1 on reasoning benchmarks and gives reliable binary filters (80–90% F1). Probing works best with mean-pooled PCA features and simple linear classifiers and helps filter training data for fine-tuning with quality comparable to using a large LLM filter.

Problem Statement

Prompting large LLMs to evaluate outputs is costly, opaque, and brittle. Small open models give poor prompt-based evaluations, but may still encode evaluative cues in hidden states. The paper asks whether those latent representations can be probed to produce cheap, reliable evaluations.

Main Contribution

Formalize the Semantic Capacity Asymmetry Hypothesis: evaluation needs less semantic capacity than generation and can be read from intermediate representations.

Introduce Representation-as-a-Judge and INSPECTOR: a pipeline that probes small-LM hidden states and trains lightweight classifiers to match a strong LLM judge.

Key Findings

Probing small-model hidden states improves evaluation F1 over prompt-based inference by a large margin.

NumbersAverage F1 increased by >20% on most tasks

Practical UseIf you currently prompt small models to rate outputs, switch to hidden-state probing to get substantially better scores for the same models.

Evidence RefFig.3; Table 10

Binary (high vs low quality) probes are highly reliable.

NumbersBinary F1 typically 8092% across models/datasets

Practical UseUse probing classifiers as a cheap, dependable coarse filter for dataset curation before expensive annotation or fine-tuning.

Evidence RefTable 10 (binary-class rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Improvement over prompt-based inference>20% average F1 increaseprompt-based small-LM inference>20% F1GSM8K, MATH, GPQA (weighted avg)Fig.3; Section 4.2Fig.3; Table 10
Binary classification F1 (probing)≈8092%prompt-based or tuned small modelssubstantially highervarious benchmarks (see Table 10)Table 10 binary rowsTable 10

What To Try In 7 Days

Run a quick probe: extract mean-pooled hidden states from your small LM on 100 example (prompt,response) pairs, apply PCA(50), train a logistic regression to match an LLM judge.

Use the probe as a binary filter (threshold ≥4) to pick high-quality responses and run a small supervised fine-tune on the filtered data.

Measure costs: compare inference time and API spend versus your current prompt-based LLM-as-judge to quantify savings.

Optimization Features

Token Efficiency
decoding-free evaluation reduces token cost
Inference Optimization
avoid autoregressive decoding for evaluationcache hidden states for repeated probing

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GSM8K (Huggingface)MATH (official repo)GPQA (public splits)

Risks & Boundaries

Limitations

Evaluation aspects and prompt templates are hand-chosen and may not cover all tasks or be optimal.

Experiments focus on mathematical/scientific reasoning; results may differ for commonsense, code, or dialog tasks.

When Not To Use

When you need reliable fine-grained (1–5) scores across very different domains without per-domain training.

When you require human-level nuanced justifications rather than coarse quality filtering.

Failure Modes

Probe overfits small balanced datasets and fails on real-world long-tailed distributions.

Probes replicate biases or blind spots of the LLM judge used as 'gold' labels.

Core Entities

Models

DeepSeek-V3 (M_large, judge)Llama-3-8B-Instruct (M_med generator)Qwen3-1.7BQwen3-0.6BLlama-3.2-1B-InstructLlama-3.1-8B-InstructLlama-2-7B-ChatRoBERTa

Metrics

Weighted average F1Binary F1 (high vs low quality)Multiclass F1 (score 1–5)

Datasets

GSM8KMATHGPQAAlpacaEval 2.0

Benchmarks

GSM8KMATHGPQAAlpacaEval 2.0

Context Entities

Models

GPT-style large LLMs (general reference)Sentinel / prior probing works (context)