Overview
Large N=2,817 evaluation plus human-calibrated threshold gives robust diagnostics, but scope is limited to six commercial models and a single judge; expect biases from that setup.
Citations0
Evidence Strength0.85
Confidence0.87
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
If you use LLMs to judge user intent or moderate content, they often misinfer hidden goals and sometimes do so with high confidence, so rely on explicit objectives, confidence gates, and human oversight for risky decisions.
Who Should Care
Summary TLDR
ObjexMT is a benchmark that asks an LLM to extract a single-sentence "base objective" from multi-turn jailbreak dialogs and report its confidence. The paper evaluates six models on 2,817 instances and freezes a human-aligned similarity threshold (τ⋆=0.66) for correctness. Top models reach 47–61% accuracy, calibration varies widely (ECE 0.206–0.417), and high-confidence errors persist (Wrong@0.90 15–48%). Main recommendations: surface objectives explicitly when possible, gate actions by confidence, and add human oversight for high-stakes cases. Data and spreadsheets are released on GitHub.
Problem Statement
Can an LLM acting as a judge reliably recover a conversation's hidden single-sentence objective under adversarial multi-turn jailbreaks, and can it honestly report when that inference is trustworthy? This matters because many moderation and auditing pipelines must infer intent from long, noisy dialogues rather than clear prompts.
Main Contribution
ObjexMT: a benchmark and protocol for extracting a single-sentence base objective from multi-turn jailbreak transcripts and for measuring self-reported confidence calibration.
Human-aligned evaluation: convert LLM-judge semantic similarity to binary correctness via a frozen threshold calibrated on N=300 human labels (τ⋆=0.66; F1=0.891).
Key Findings
Top-line extraction accuracy is low to moderate across models.
Judge calibration varies and can be poor.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| judge threshold (human-aligned) | τ⋆ = 0.66 (F1 = 0.891 on N=300) | — | — | Calibration set N=300 | Threshold selected via sweep against human labels | Table 1, §3.6 |
| Accuracy | kimi-k2 0.612; claude-sonnet-4 0.603; deepseek-v3.1 0.599; gemini-2.5-flash 0.542; gpt-4.1 0.490; Qwen3-235B 0.474 | — | — | Full benchmark N=2,817 (all datasets) | Bootstrap 95% CIs reported; top-3 not mutually significant | Table 3, Table 2 |
What To Try In 7 Days
Run ObjexMT on a small slice of your moderator dialogs to measure judge accuracy and ECE using the released spreadsheets.
Require users to restate objectives when dialogs are short or fragmented; add a 'state your goal' prompt before auto-actions.
Implement a simple confidence gate (e.g., accept actions only if model p>0.9 and ECE acceptable) and route low-confidence cases to humans.
Reproducibility
Risks & Boundaries
Limitations
Only six large commercial models evaluated; no small open-source models or safety-tuned variants.
Single LLM judge (gpt-4.1) may introduce systematic bias in similarity scoring.
When Not To Use
Don't use as a final gate for high-stakes automated moderation without human oversight.
Not suitable when intents are inherently multi-objective or require multi-sentence recovery.
Failure Modes
Overconfident wrong answers: models report high p but are incorrect (Wrong@0.90 up to 47.7%).
Systematic judge bias from single similarity model and fixed threshold.

