Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Overview

Decision SnapshotNeeds Validation

The framework is practical and reproducible on similar enterprise case sets; experiments show statistically significant long-query separation and strong judge-human agreement, but rubric weights and dataset representativeness need local calibration.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 70%

Novelty: 45%

Authors

Mukul Chhabra, Luigi Medrano, Arush Verma

Links

Abstract / PDF

Why It Matters For Business

Enterprise RAG failures often stem from workflow or identifier mistakes that standard metrics miss; this framework gives targeted diagnostics for release gating, regression tests, and monitoring to reduce costly production incidents.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper introduces a practical, case-aware LLM-as-a-judge framework to evaluate multi-turn enterprise Retrieval-Augmented Generation (RAG) systems. It defines eight operational metrics (e.g., Hallucination, Retrieval Correctness, Identifier Integrity), a severity-aware scoring scheme, deterministic JSON outputs for auditability, and a single-LLM-per-turn batch pipeline. On an anonymized enterprise dataset, the framework revealed workflow and identifier failures that generic faithfulness/relevance metrics miss. In experiments, GPT-oss outperformed Llama on long diagnostic queries (weighted aggregate 0.8099 vs 0.7136, p=0.0011). The implementation is model-agnostic, designed for regression/g

Problem Statement

Standard RAG evaluation focuses on single-turn faithfulness or relevance and conflates retrieval, grounding, and resolution. Enterprise support workflows are multi-turn, require strict handling of structured identifiers (error codes, versions), and must follow prescribed troubleshooting order. Existing metrics miss operational failure modes like case misidentification, workflow misalignment, and partial resolution.

Main Contribution

Formalize enterprise multi-turn RAG evaluation needs and recurring operational failure modes.

Define eight case-aware metrics separating retrieval, grounding, answer utility, precision, and workflow alignment.

Key Findings

Case-aware metrics reveal workflow and identifier failures that proxy metrics miss.

Practical UseUse multi-dimensional, case-aware scoring instead of only faithfulness/relevance to find deploy-blocking issues (e.g., workflow violations, identifier corruption).

Evidence RefSections 8.5, 8.6; Table 4

On long diagnostic queries GPT-oss achieved higher weighted aggregate than Llama.

NumbersGPT-oss 0.8099 vs Llama 0.7136; ∆=0.0963; p=0.0011

Practical UsePrefer the model with higher case-aware S_final for context-heavy enterprise tasks; test models on long multi-turn cases before rollout.

Evidence RefTable 2 (Long Queries) and Section 8.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Weighted Aggregate (Long queries)	GPT-oss 0.8099; Llama 0.7136	—	0.0963	Long queries (n=63 conversations)	Table 2; Section 8.1	Table 2
Weighted Aggregate (Short queries)	GPT-oss 0.7353; Llama 0.7202	—	0.0151	Short queries (n=70 conversations)	Table 2; Section 8.1	Table 2

What To Try In 7 Days

Run the JSON judge on 100 representative multi-turn cases to get per-metric baselines.

Set a conservative S_final release threshold and run pre-deployment checks on candidate models.

Add identifier integrity checks to your evaluation pipeline to catch corrupted commands/IDs early.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Requires representative multi-turn case logs and good case metadata (subject/description).

Rubric weights and severity bands must be tuned for each organization and risk tolerance.

When Not To Use

Small single-turn QA benchmarks where multi-turn workflow is irrelevant.

Open-domain evaluation where external knowledge beyond retrieved context is required.

Failure Modes

Case misidentification (wrong issue despite relevant text)

Workflow misalignment (violating required sequencing)

Core Entities

Models

Llama-3.3-70B-Instructgpt-oss-120bGPT-4 (judge via Azure OpenAI)

Metrics

HallucinationRetrieval CorrectnessContext SufficiencyAnswer HelpfulnessAnswer Type FitIdentifier IntegrityCase Issue IdentificationResolution AlignmentWeighted Aggregate (S_final)

Datasets

Anonymized enterprise support cases (short: 237 cases, long: 232 cases)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Case-aware metrics reveal workflow and identifier failures that proxy metrics miss.

On long diagnostic queries GPT-oss achieved higher weighted aggregate than Llama.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding