Overview
Production Readiness
0.6
Novelty Score
0.45
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If you use LLMs to judge user intent or moderate content, they often misinfer hidden goals and sometimes do so with high confidence, so rely on explicit objectives, confidence gates, and human oversight for risky decisions.
Summary TLDR
ObjexMT is a benchmark that asks an LLM to extract a single-sentence "base objective" from multi-turn jailbreak dialogs and report its confidence. The paper evaluates six models on 2,817 instances and freezes a human-aligned similarity threshold (τ⋆=0.66) for correctness. Top models reach 47–61% accuracy, calibration varies widely (ECE 0.206–0.417), and high-confidence errors persist (Wrong@0.90 15–48%). Main recommendations: surface objectives explicitly when possible, gate actions by confidence, and add human oversight for high-stakes cases. Data and spreadsheets are released on GitHub.
Problem Statement
Can an LLM acting as a judge reliably recover a conversation's hidden single-sentence objective under adversarial multi-turn jailbreaks, and can it honestly report when that inference is trustworthy? This matters because many moderation and auditing pipelines must infer intent from long, noisy dialogues rather than clear prompts.
Main Contribution
ObjexMT: a benchmark and protocol for extracting a single-sentence base objective from multi-turn jailbreak transcripts and for measuring self-reported confidence calibration.
Human-aligned evaluation: convert LLM-judge semantic similarity to binary correctness via a frozen threshold calibrated on N=300 human labels (τ⋆=0.66; F1=0.891).
Large-scale evaluation: single-pass tests of six models on 2,817 instances across three public datasets, with accuracy, ECE, Brier, Wrong@High-Conf, and selective-prediction analyses.
Operational diagnostics and prescriptions: dataset heterogeneity, length/turn effects, and practical guidance to gate actions by confidence or surface objectives.
Key Findings
Top-line extraction accuracy is low to moderate across models.
Judge calibration varies and can be poor.
High-confidence errors remain common.
Dataset construction drives difficulty (large heterogeneity).
Transcript length and turn count affect recovery nonlinearly.
Results
judge threshold (human-aligned)
Accuracy
best calibration (ECE / Brier / AURC)
high-confidence error (Wrong@0.90) range
Accuracy
Who Should Care
What To Try In 7 Days
Run ObjexMT on a small slice of your moderator dialogs to measure judge accuracy and ECE using the released spreadsheets.
Require users to restate objectives when dialogs are short or fragmented; add a 'state your goal' prompt before auto-actions.
Implement a simple confidence gate (e.g., accept actions only if model p>0.9 and ECE acceptable) and route low-confidence cases to humans.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only six large commercial models evaluated; no small open-source models or safety-tuned variants.
- Single LLM judge (gpt-4.1) may introduce systematic bias in similarity scoring.
- Single-sentence gold objectives may oversimplify multi-objective attacks.
- Deterministic decoding (one pass) understates model output variability.
When Not To Use
- Don't use as a final gate for high-stakes automated moderation without human oversight.
- Not suitable when intents are inherently multi-objective or require multi-sentence recovery.
- Avoid assuming benchmark accuracies transfer to domains with different obfuscation styles.
Failure Modes
- Overconfident wrong answers: models report high p but are incorrect (Wrong@0.90 up to 47.7%).
- Systematic judge bias from single similarity model and fixed threshold.
- Dataset mismatch: automated attack datasets produce far more errors than human-authored dialogs.
- Short or mid-length fragmented dialogs (especially 5–6 turns) yield peak error despite high confidence.
Core Entities
Models
- gpt-4.1
- claude-sonnet-4
- Qwen3-235B-A22B-FP8
- kimi-k2
- deepseek-v3.1
- gemini-2.5-flash
Metrics
- Accuracy
- ECE
- Brier
- Wrong@0.80
- Wrong@0.90
- Wrong@0.95
- AURC
- F1
- similarity_score
Datasets
- SafeMTData_Attack600
- SafeMTData_1K
- MHJ

