Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Automated judge evaluations steer model selection and RL updates. Short, plausible tokens can make judges accept wrong outputs, enabling reward-hacking or broken selection. This can silently degrade deployed systems and downstream policies.
Summary TLDR
The authors show a practical attack surface for LLM-based judges: short, low-perplexity control-token sequences discovered without seeds (AdvJudge-Zero) can flip many 'No' judgments to incorrect 'Yes' decisions by steering the last-layer logit gap. These tokens are low-rank steering directions in the final hidden layer, transfer across model families, and can be largely mitigated by small LoRA adversarial fine-tuning.
Problem Statement
LLM-as-a-Judge systems return a single-token binary decision (Yes/No). The paper asks whether short, natural token sequences that a policy could produce during post-training can systematically flip those binary judgments from 'No' (refuse/incorrect) to 'Yes' (accept/correct) without changing the answer quality.
Main Contribution
A geometric view showing binary judge decisions are set by a shallow linear readout on the final hidden state; small, targeted hidden-state moves can flip the decision.
AdvJudge-Zero: a zero-seed discovery algorithm that uses the model's next-token distribution and beam-style exploration to find short, low-perplexity control tokens that flip binary judge outputs.
A cross-family empirical study showing these tokens produce very high false-positive rates on math/reasoning benchmarks, and that small LoRA adversarial fine-tuning can cut false positives by orders of magnitude while preserving true positives.
Key Findings
AdvJudge-Zero ensembles drive very high false-positive rates (incorrect answers judged 'Yes') on math/reasoning datasets.
The successful control tokens correspond to a low-rank, directional perturbation in final-layer hidden states.
LoRA adversarial fine-tuning on control-token-augmented examples markedly reduces false positives while preserving or improving true positives.
Results
Ensemble False Positive Rate (AdvJudge-Zero)
Geometric low-rank steering
LoRA
Who Should Care
What To Try In 7 Days
Run AdvJudge-Zero or similar beam search over your judge on a representative dev set to measure FPR under token perturbations.
Create a small control-token-augmented training set (few thousand examples) and run a short LoRA fine-tune to test FPR reduction.
Add simple detectors for structural/control tokens (markdown, separators, assistant markers) to flag suspect evaluations in logging and metrics pipelines.
Optimization Features
Token Efficiency
- Focus on short (1–7 token) control sequences
Training Optimization
- LoRA
Reproducibility
Data Urls
- AIME (Veeraboina, 2023) https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024
- MATH (Hendrycks et al., 2021) nlile/hendrycks-MATH-benchmark
- GSM8K (Cobbe et al., 2021) openai/gsm8k
- Multi-subject RLVR (Su et al., 2025) virtuoussy/Multi-subject-RLVR
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focus limited to binary correctness judgments on open-weight models; does not evaluate preference-ranking or safety filters.
- Control tokens and results are shown on research models; proprietary production judges may differ.
- Mitigation experiments are limited to LoRA fine-tuning and a single vulnerable judge (Omni-Judge).
- Authors do not release attack tokens, so exact replication of specific sequences is restricted.
When Not To Use
- Do not assume the same tokens or vulnerabilities apply to non-binary judgments (ranking, scoring) without testing.
- Do not rely solely on LoRA adversarial fine-tuning as a permanent fix; it may reshape but not eliminate vulnerable directions.
- Avoid applying the exact uncovered token lists without careful red-team review (authors withheld attack strings).
Failure Modes
- Adversarial training may overfit to discovered token ensembles and miss other vulnerable directions.
- Production-formatting or different tokenizers can shift token effects and invalidate transferability.
- Stronger multi-layer or deeper readout designs could move the vulnerability surface, requiring re-discovery.
Core Entities
Models
- Qwen (2.5/3 variants)
- Llama-3.2/3.3 (3B, 70B)
- Gemma-3 (4B)
- Omni-Judge
- Qwen2.5-7B-Instruct-RLVR
- general-verifier
- Master-RM
Metrics
- False Positive Rate (FPR)
- True Positive Rate (TPR)
- Last-layer logit gap F = z_no - z_yes
- PC1 variance; cosine alignment Z-score
Datasets
- AIME
- MATH
- Multi-subject RLVR
- GSM8K
Benchmarks
- RLVR-style evaluation
- math/reasoning benchmarks (AIME, MATH, GSM8K)

