Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

December 19, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

0

Authors

Tung-Ling Li, Yuhao Wu, Hongliang Liu

Links

Abstract / PDF

Why It Matters For Business

Automated judge evaluations steer model selection and RL updates. Short, plausible tokens can make judges accept wrong outputs, enabling reward-hacking or broken selection. This can silently degrade deployed systems and downstream policies.

Summary TLDR

The authors show a practical attack surface for LLM-based judges: short, low-perplexity control-token sequences discovered without seeds (AdvJudge-Zero) can flip many 'No' judgments to incorrect 'Yes' decisions by steering the last-layer logit gap. These tokens are low-rank steering directions in the final hidden layer, transfer across model families, and can be largely mitigated by small LoRA adversarial fine-tuning.

Problem Statement

LLM-as-a-Judge systems return a single-token binary decision (Yes/No). The paper asks whether short, natural token sequences that a policy could produce during post-training can systematically flip those binary judgments from 'No' (refuse/incorrect) to 'Yes' (accept/correct) without changing the answer quality.

Main Contribution

A geometric view showing binary judge decisions are set by a shallow linear readout on the final hidden state; small, targeted hidden-state moves can flip the decision.

AdvJudge-Zero: a zero-seed discovery algorithm that uses the model's next-token distribution and beam-style exploration to find short, low-perplexity control tokens that flip binary judge outputs.

A cross-family empirical study showing these tokens produce very high false-positive rates on math/reasoning benchmarks, and that small LoRA adversarial fine-tuning can cut false positives by orders of magnitude while preserving true positives.

Key Findings

AdvJudge-Zero ensembles drive very high false-positive rates (incorrect answers judged 'Yes') on math/reasoning datasets.

NumbersFPRs: AIME 98.64%, MATH 99.91%, Multi-subject RLVR 94.75% (Section 4.2, Table 2)

The successful control tokens correspond to a low-rank, directional perturbation in final-layer hidden states.

NumbersPC1 explains 28–35% variance; cosine alignments give Z = -7.47 (Qwen) and -4.80 (Llama) (Table 1)

LoRA adversarial fine-tuning on control-token-augmented examples markedly reduces false positives while preserving or improving true positives.

NumbersOmni-Judge FPR drops AIME 96.46→1.80%, MATH 99.41→5.62%, GSM8K 99.79→6.38%, RLVR 49.47→0.96% (Table 4)

Results

Ensemble False Positive Rate (AdvJudge-Zero)

ValueAIME 98.64%; MATH 99.91%; Multi-subject RLVR 94.75%

BaselineMaster-RM baseline 61.13% / 71.86% / 54.46%

Geometric low-rank steering

ValuePC1 explains 28–35% of perturbation variance

BaselineRandom isotropic null ~0.03%

LoRA

ValueOmni-Judge FPRs after fine-tuning: AIME 1.80%; MATH 5.62%; GSM8K 6.38%; RLVR 0.96%

BaselineBase FPRs: 96.46%, 99.41%, 99.79%, 49.47%

Who Should Care

What To Try In 7 Days

Run AdvJudge-Zero or similar beam search over your judge on a representative dev set to measure FPR under token perturbations.

Create a small control-token-augmented training set (few thousand examples) and run a short LoRA fine-tune to test FPR reduction.

Add simple detectors for structural/control tokens (markdown, separators, assistant markers) to flag suspect evaluations in logging and metrics pipelines.

Optimization Features

Token Efficiency

  • Focus on short (1–7 token) control sequences

Training Optimization

  • LoRA

Reproducibility

Data Urls

  • AIME (Veeraboina, 2023) https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024
  • MATH (Hendrycks et al., 2021) nlile/hendrycks-MATH-benchmark
  • GSM8K (Cobbe et al., 2021) openai/gsm8k
  • Multi-subject RLVR (Su et al., 2025) virtuoussy/Multi-subject-RLVR

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focus limited to binary correctness judgments on open-weight models; does not evaluate preference-ranking or safety filters.
  • Control tokens and results are shown on research models; proprietary production judges may differ.
  • Mitigation experiments are limited to LoRA fine-tuning and a single vulnerable judge (Omni-Judge).
  • Authors do not release attack tokens, so exact replication of specific sequences is restricted.

When Not To Use

  • Do not assume the same tokens or vulnerabilities apply to non-binary judgments (ranking, scoring) without testing.
  • Do not rely solely on LoRA adversarial fine-tuning as a permanent fix; it may reshape but not eliminate vulnerable directions.
  • Avoid applying the exact uncovered token lists without careful red-team review (authors withheld attack strings).

Failure Modes

  • Adversarial training may overfit to discovered token ensembles and miss other vulnerable directions.
  • Production-formatting or different tokenizers can shift token effects and invalidate transferability.
  • Stronger multi-layer or deeper readout designs could move the vulnerability surface, requiring re-discovery.

Core Entities

Models

  • Qwen (2.5/3 variants)
  • Llama-3.2/3.3 (3B, 70B)
  • Gemma-3 (4B)
  • Omni-Judge
  • Qwen2.5-7B-Instruct-RLVR
  • general-verifier
  • Master-RM

Metrics

  • False Positive Rate (FPR)
  • True Positive Rate (TPR)
  • Last-layer logit gap F = z_no - z_yes
  • PC1 variance; cosine alignment Z-score

Datasets

  • AIME
  • MATH
  • Multi-subject RLVR
  • GSM8K

Benchmarks

  • RLVR-style evaluation
  • math/reasoning benchmarks (AIME, MATH, GSM8K)