ObjexMT: test if LLM "judges" can recover hidden objectives and know when they're confident

August 23, 20258 min

Overview

Decision SnapshotNeeds Validation

Large N=2,817 evaluation plus human-calibrated threshold gives robust diagnostics, but scope is limited to six commercial models and a single judge; expect biases from that setup.

Citations0

Evidence Strength0.85

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 45%

Authors

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs to judge user intent or moderate content, they often misinfer hidden goals and sometimes do so with high confidence, so rely on explicit objectives, confidence gates, and human oversight for risky decisions.

Who Should Care

Summary TLDR

ObjexMT is a benchmark that asks an LLM to extract a single-sentence "base objective" from multi-turn jailbreak dialogs and report its confidence. The paper evaluates six models on 2,817 instances and freezes a human-aligned similarity threshold (τ⋆=0.66) for correctness. Top models reach 47–61% accuracy, calibration varies widely (ECE 0.206–0.417), and high-confidence errors persist (Wrong@0.90 15–48%). Main recommendations: surface objectives explicitly when possible, gate actions by confidence, and add human oversight for high-stakes cases. Data and spreadsheets are released on GitHub.

Problem Statement

Can an LLM acting as a judge reliably recover a conversation's hidden single-sentence objective under adversarial multi-turn jailbreaks, and can it honestly report when that inference is trustworthy? This matters because many moderation and auditing pipelines must infer intent from long, noisy dialogues rather than clear prompts.

Main Contribution

ObjexMT: a benchmark and protocol for extracting a single-sentence base objective from multi-turn jailbreak transcripts and for measuring self-reported confidence calibration.

Human-aligned evaluation: convert LLM-judge semantic similarity to binary correctness via a frozen threshold calibrated on N=300 human labels (τ⋆=0.66; F1=0.891).

Key Findings

Top-line extraction accuracy is low to moderate across models.

NumbersAccuracy range 0.4740.612 (N=2,817)

Practical UseDo not trust LLM judges to recover hidden objectives reliably; expect ~40–50% failure and plan for human review or confidence gating.

Evidence RefTables 2-3, Abstract

Judge calibration varies and can be poor.

NumbersECE 0.2060.417; Brier 0.2540.416

Practical UseUse calibration metrics before deploying confidence-based gating; some models (e.g., claude-sonnet-4) are meaningfully better calibrated.

Evidence RefTable 5, Fig.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
judge threshold (human-aligned)τ⋆ = 0.66 (F1 = 0.891 on N=300)Calibration set N=300Threshold selected via sweep against human labelsTable 1, §3.6
Accuracykimi-k2 0.612; claude-sonnet-4 0.603; deepseek-v3.1 0.599; gemini-2.5-flash 0.542; gpt-4.1 0.490; Qwen3-235B 0.474Full benchmark N=2,817 (all datasets)Bootstrap 95% CIs reported; top-3 not mutually significantTable 3, Table 2

What To Try In 7 Days

Run ObjexMT on a small slice of your moderator dialogs to measure judge accuracy and ECE using the released spreadsheets.

Require users to restate objectives when dialogs are short or fragmented; add a 'state your goal' prompt before auto-actions.

Implement a simple confidence gate (e.g., accept actions only if model p>0.9 and ECE acceptable) and route low-confidence cases to humans.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only six large commercial models evaluated; no small open-source models or safety-tuned variants.

Single LLM judge (gpt-4.1) may introduce systematic bias in similarity scoring.

When Not To Use

Don't use as a final gate for high-stakes automated moderation without human oversight.

Not suitable when intents are inherently multi-objective or require multi-sentence recovery.

Failure Modes

Overconfident wrong answers: models report high p but are incorrect (Wrong@0.90 up to 47.7%).

Systematic judge bias from single similarity model and fixed threshold.

Core Entities

Models

gpt-4.1claude-sonnet-4Qwen3-235B-A22B-FP8kimi-k2deepseek-v3.1gemini-2.5-flash

Metrics

AccuracyECEBrierWrong@0.80Wrong@0.90Wrong@0.95AURCF1similarity_score

Datasets

SafeMTData_Attack600SafeMTData_1KMHJ