Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

January 30, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He

Links

Abstract / PDF

Why It Matters For Business

You can replace expensive LLM-judge pipelines with cheap probes on small open models to filter and evaluate data at far lower cost while keeping most practical value for downstream fine-tuning.

Summary TLDR

The paper argues that judging model outputs needs less semantic capacity than generating them. Instead of prompting big LLMs to score answers, the authors probe internal hidden states of small LMs and train light classifiers to predict aspect-level scores from a strong LLM judge. Their INSPECTOR pipeline beats prompt-based small-model evaluation by >20% F1 on reasoning benchmarks and gives reliable binary filters (80–90% F1). Probing works best with mean-pooled PCA features and simple linear classifiers and helps filter training data for fine-tuning with quality comparable to using a large LLM filter.

Problem Statement

Prompting large LLMs to evaluate outputs is costly, opaque, and brittle. Small open models give poor prompt-based evaluations, but may still encode evaluative cues in hidden states. The paper asks whether those latent representations can be probed to produce cheap, reliable evaluations.

Main Contribution

Formalize the Semantic Capacity Asymmetry Hypothesis: evaluation needs less semantic capacity than generation and can be read from intermediate representations.

Introduce Representation-as-a-Judge and INSPECTOR: a pipeline that probes small-LM hidden states and trains lightweight classifiers to match a strong LLM judge.

Show empirically that probing outperforms prompt-based small-LM evaluation and yields practical binary filters that aid supervised fine-tuning.

Key Findings

Probing small-model hidden states improves evaluation F1 over prompt-based inference by a large margin.

NumbersAverage F1 increased by >20% on most tasks

Binary (high vs low quality) probes are highly reliable.

NumbersBinary F1 typically 80–92% across models/datasets

Multiclass (1–5) score prediction is substantially harder than binary.

NumbersMulticlass F1 roughly 50–60% for best probes

Out-of-distribution transfer is weak for fine-grained scores but better for binary judgments.

NumbersOOD multiclass F1 ≈10–25%; OOD binary F1 ≈35–62%

Probing-based filtering yields SFT gains comparable to using a strong LLM filter.

NumbersStudent SFT curves show comparable performance to DeepSeek-V3 filtering

Results

Improvement over prompt-based inference

Value>20% average F1 increase

Baselineprompt-based small-LM inference

Binary classification F1 (probing)

Value≈80–92%

Baselineprompt-based or tuned small models

Multiclass (1–5) F1 (probing)

Value≈50–60% top configurations

Baselineprompt-based inference (low)

OOD transfer

Valuemulticlass F1 ≈10–25%; binary F1 ≈35–62%

Baselinein-distribution probing

SFT

Valueprobing-filter SFT comparable to DeepSeek-V3-filtered SFT

BaselineDeepSeek-V3 filtering

Who Should Care

What To Try In 7 Days

Run a quick probe: extract mean-pooled hidden states from your small LM on 100 example (prompt,response) pairs, apply PCA(50), train a logistic regression to match an LLM judge.

Use the probe as a binary filter (threshold ≥4) to pick high-quality responses and run a small supervised fine-tune on the filtered data.

Measure costs: compare inference time and API spend versus your current prompt-based LLM-as-judge to quantify savings.

Optimization Features

Token Efficiency

  • decoding-free evaluation reduces token cost

Inference Optimization

  • avoid autoregressive decoding for evaluation
  • cache hidden states for repeated probing

Reproducibility

Data Urls

  • GSM8K (Huggingface)
  • MATH (official repo)
  • GPQA (public splits)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation aspects and prompt templates are hand-chosen and may not cover all tasks or be optimal.
  • Experiments focus on mathematical/scientific reasoning; results may differ for commonsense, code, or dialog tasks.
  • Annotations use a single strong judge (DeepSeek-V3), which may bias probe training and evaluations.

When Not To Use

  • When you need reliable fine-grained (1–5) scores across very different domains without per-domain training.
  • When you require human-level nuanced justifications rather than coarse quality filtering.
  • If your rating LLM is unavailable or you cannot produce balanced probing labels for training.

Failure Modes

  • Probe overfits small balanced datasets and fails on real-world long-tailed distributions.
  • Probes replicate biases or blind spots of the LLM judge used as 'gold' labels.
  • Multiclass predictions degrade a lot under domain shift, producing misleading fine-grained scores.

Core Entities

Models

  • DeepSeek-V3 (M_large, judge)
  • Llama-3-8B-Instruct (M_med generator)
  • Qwen3-1.7B
  • Qwen3-0.6B
  • Llama-3.2-1B-Instruct
  • Llama-3.1-8B-Instruct
  • Llama-2-7B-Chat
  • RoBERTa

Metrics

  • Weighted average F1
  • Binary F1 (high vs low quality)
  • Multiclass F1 (score 1–5)

Datasets

  • GSM8K
  • MATH
  • GPQA
  • AlpacaEval 2.0

Benchmarks

  • GSM8K
  • MATH
  • GPQA
  • AlpacaEval 2.0

Context Entities

Models

  • GPT-style large LLMs (general reference)
  • Sentinel / prior probing works (context)