ObjexMT: test if LLM "judges" can recover hidden objectives and know when they're confident

August 23, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.5

Citation Count

0

Authors

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to judge user intent or moderate content, they often misinfer hidden goals and sometimes do so with high confidence, so rely on explicit objectives, confidence gates, and human oversight for risky decisions.

Summary TLDR

ObjexMT is a benchmark that asks an LLM to extract a single-sentence "base objective" from multi-turn jailbreak dialogs and report its confidence. The paper evaluates six models on 2,817 instances and freezes a human-aligned similarity threshold (τ⋆=0.66) for correctness. Top models reach 47–61% accuracy, calibration varies widely (ECE 0.206–0.417), and high-confidence errors persist (Wrong@0.90 15–48%). Main recommendations: surface objectives explicitly when possible, gate actions by confidence, and add human oversight for high-stakes cases. Data and spreadsheets are released on GitHub.

Problem Statement

Can an LLM acting as a judge reliably recover a conversation's hidden single-sentence objective under adversarial multi-turn jailbreaks, and can it honestly report when that inference is trustworthy? This matters because many moderation and auditing pipelines must infer intent from long, noisy dialogues rather than clear prompts.

Main Contribution

ObjexMT: a benchmark and protocol for extracting a single-sentence base objective from multi-turn jailbreak transcripts and for measuring self-reported confidence calibration.

Human-aligned evaluation: convert LLM-judge semantic similarity to binary correctness via a frozen threshold calibrated on N=300 human labels (τ⋆=0.66; F1=0.891).

Large-scale evaluation: single-pass tests of six models on 2,817 instances across three public datasets, with accuracy, ECE, Brier, Wrong@High-Conf, and selective-prediction analyses.

Operational diagnostics and prescriptions: dataset heterogeneity, length/turn effects, and practical guidance to gate actions by confidence or surface objectives.

Key Findings

Top-line extraction accuracy is low to moderate across models.

NumbersAccuracy range 0.474–0.612 (N=2,817)

Judge calibration varies and can be poor.

NumbersECE 0.206–0.417; Brier 0.254–0.416

High-confidence errors remain common.

NumbersWrong@0.90 ranges 14.9%–47.7%

Dataset construction drives difficulty (large heterogeneity).

NumbersAvg. accuracy by dataset: Attack600 24.3%, SafeMT_1K 57.0%, MHJ 80.9%

Transcript length and turn count affect recovery nonlinearly.

NumbersAccuracy increases with length quartile: Q1→Q4 from ~0.33→0.81 (avg); error peaks at 5–6 turns

Results

judge threshold (human-aligned)

Valueτ⋆ = 0.66 (F1 = 0.891 on N=300)

Accuracy

Valuekimi-k2 0.612; claude-sonnet-4 0.603; deepseek-v3.1 0.599; gemini-2.5-flash 0.542; gpt-4.1 0.490; Qwen3-235B 0.474

best calibration (ECE / Brier / AURC)

Valueclaude-sonnet-4 ECE 0.206; Brier 0.254; AURC 0.242

high-confidence error (Wrong@0.90) range

Value14.9% (best) — 47.7% (worst)

Accuracy

ValueAttack600 24.3%; SafeMT_1K 57.0%; MHJ 80.9%

Who Should Care

What To Try In 7 Days

Run ObjexMT on a small slice of your moderator dialogs to measure judge accuracy and ECE using the released spreadsheets.

Require users to restate objectives when dialogs are short or fragmented; add a 'state your goal' prompt before auto-actions.

Implement a simple confidence gate (e.g., accept actions only if model p>0.9 and ECE acceptable) and route low-confidence cases to humans.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only six large commercial models evaluated; no small open-source models or safety-tuned variants.
  • Single LLM judge (gpt-4.1) may introduce systematic bias in similarity scoring.
  • Single-sentence gold objectives may oversimplify multi-objective attacks.
  • Deterministic decoding (one pass) understates model output variability.

When Not To Use

  • Don't use as a final gate for high-stakes automated moderation without human oversight.
  • Not suitable when intents are inherently multi-objective or require multi-sentence recovery.
  • Avoid assuming benchmark accuracies transfer to domains with different obfuscation styles.

Failure Modes

  • Overconfident wrong answers: models report high p but are incorrect (Wrong@0.90 up to 47.7%).
  • Systematic judge bias from single similarity model and fixed threshold.
  • Dataset mismatch: automated attack datasets produce far more errors than human-authored dialogs.
  • Short or mid-length fragmented dialogs (especially 5–6 turns) yield peak error despite high confidence.

Core Entities

Models

  • gpt-4.1
  • claude-sonnet-4
  • Qwen3-235B-A22B-FP8
  • kimi-k2
  • deepseek-v3.1
  • gemini-2.5-flash

Metrics

  • Accuracy
  • ECE
  • Brier
  • Wrong@0.80
  • Wrong@0.90
  • Wrong@0.95
  • AURC
  • F1
  • similarity_score

Datasets

  • SafeMTData_Attack600
  • SafeMTData_1K
  • MHJ