ObjexMT: test if LLM "judges" can recover hidden objectives and know when they're confident

Overview

Decision SnapshotNeeds Validation

Large N=2,817 evaluation plus human-calibrated threshold gives robust diagnostics, but scope is limited to six commercial models and a single judge; expect biases from that setup.

Citations0

Evidence Strength0.85

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 45%

Authors

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs to judge user intent or moderate content, they often misinfer hidden goals and sometimes do so with high confidence, so rely on explicit objectives, confidence gates, and human oversight for risky decisions.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO

Summary TLDR

ObjexMT is a benchmark that asks an LLM to extract a single-sentence "base objective" from multi-turn jailbreak dialogs and report its confidence. The paper evaluates six models on 2,817 instances and freezes a human-aligned similarity threshold (τ⋆=0.66) for correctness. Top models reach 47–61% accuracy, calibration varies widely (ECE 0.206–0.417), and high-confidence errors persist (Wrong@0.90 15–48%). Main recommendations: surface objectives explicitly when possible, gate actions by confidence, and add human oversight for high-stakes cases. Data and spreadsheets are released on GitHub.

Problem Statement

Can an LLM acting as a judge reliably recover a conversation's hidden single-sentence objective under adversarial multi-turn jailbreaks, and can it honestly report when that inference is trustworthy? This matters because many moderation and auditing pipelines must infer intent from long, noisy dialogues rather than clear prompts.

Main Contribution

ObjexMT: a benchmark and protocol for extracting a single-sentence base objective from multi-turn jailbreak transcripts and for measuring self-reported confidence calibration.

Human-aligned evaluation: convert LLM-judge semantic similarity to binary correctness via a frozen threshold calibrated on N=300 human labels (τ⋆=0.66; F1=0.891).

Key Findings

Top-line extraction accuracy is low to moderate across models.

NumbersAccuracy range 0.474–0.612 (N=2,817)

Practical UseDo not trust LLM judges to recover hidden objectives reliably; expect ~40–50% failure and plan for human review or confidence gating.

Evidence RefTables 2-3, Abstract

Judge calibration varies and can be poor.

NumbersECE 0.206–0.417; Brier 0.254–0.416

Practical UseUse calibration metrics before deploying confidence-based gating; some models (e.g., claude-sonnet-4) are meaningfully better calibrated.

Evidence RefTable 5, Fig.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
judge threshold (human-aligned)	τ⋆ = 0.66 (F1 = 0.891 on N=300)	—	—	Calibration set N=300	Threshold selected via sweep against human labels	Table 1, §3.6
Accuracy	kimi-k2 0.612; claude-sonnet-4 0.603; deepseek-v3.1 0.599; gemini-2.5-flash 0.542; gpt-4.1 0.490; Qwen3-235B 0.474	—	—	Full benchmark N=2,817 (all datasets)	Bootstrap 95% CIs reported; top-3 not mutually significant	Table 3, Table 2

What To Try In 7 Days

Run ObjexMT on a small slice of your moderator dialogs to measure judge accuracy and ECE using the released spreadsheets.

Require users to restate objectives when dialogs are short or fragmented; add a 'state your goal' prompt before auto-actions.

Implement a simple confidence gate (e.g., accept actions only if model p>0.9 and ECE acceptable) and route low-confidence cases to humans.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hyunjun1121/ObjexMT_dataset

Data URLs

https://github.com/hyunjun1121/ObjexMT_dataset

Risks & Boundaries

Limitations

Only six large commercial models evaluated; no small open-source models or safety-tuned variants.

Single LLM judge (gpt-4.1) may introduce systematic bias in similarity scoring.

When Not To Use

Don't use as a final gate for high-stakes automated moderation without human oversight.

Not suitable when intents are inherently multi-objective or require multi-sentence recovery.

Failure Modes

Overconfident wrong answers: models report high p but are incorrect (Wrong@0.90 up to 47.7%).

Systematic judge bias from single similarity model and fixed threshold.

Core Entities

Models

gpt-4.1claude-sonnet-4Qwen3-235B-A22B-FP8kimi-k2deepseek-v3.1gemini-2.5-flash

Metrics

AccuracyECEBrierWrong@0.80Wrong@0.90Wrong@0.95AURCF1similarity_score

Datasets

SafeMTData_Attack600SafeMTData_1KMHJ

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top-line extraction accuracy is low to moderate across models.

Judge calibration varies and can be poor.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding