Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

February 23, 20267 min

Overview

Decision SnapshotNeeds Validation

The framework is practical and reproducible on similar enterprise case sets; experiments show statistically significant long-query separation and strong judge-human agreement, but rubric weights and dataset representativeness need local calibration.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 70%

Novelty: 45%

Authors

Mukul Chhabra, Luigi Medrano, Arush Verma

Links

Abstract / PDF

Why It Matters For Business

Enterprise RAG failures often stem from workflow or identifier mistakes that standard metrics miss; this framework gives targeted diagnostics for release gating, regression tests, and monitoring to reduce costly production incidents.

Who Should Care

Summary TLDR

The paper introduces a practical, case-aware LLM-as-a-judge framework to evaluate multi-turn enterprise Retrieval-Augmented Generation (RAG) systems. It defines eight operational metrics (e.g., Hallucination, Retrieval Correctness, Identifier Integrity), a severity-aware scoring scheme, deterministic JSON outputs for auditability, and a single-LLM-per-turn batch pipeline. On an anonymized enterprise dataset, the framework revealed workflow and identifier failures that generic faithfulness/relevance metrics miss. In experiments, GPT-oss outperformed Llama on long diagnostic queries (weighted aggregate 0.8099 vs 0.7136, p=0.0011). The implementation is model-agnostic, designed for regression/g

Problem Statement

Standard RAG evaluation focuses on single-turn faithfulness or relevance and conflates retrieval, grounding, and resolution. Enterprise support workflows are multi-turn, require strict handling of structured identifiers (error codes, versions), and must follow prescribed troubleshooting order. Existing metrics miss operational failure modes like case misidentification, workflow misalignment, and partial resolution.

Main Contribution

Formalize enterprise multi-turn RAG evaluation needs and recurring operational failure modes.

Define eight case-aware metrics separating retrieval, grounding, answer utility, precision, and workflow alignment.

Key Findings

Case-aware metrics reveal workflow and identifier failures that proxy metrics miss.

Practical UseUse multi-dimensional, case-aware scoring instead of only faithfulness/relevance to find deploy-blocking issues (e.g., workflow violations, identifier corruption).

Evidence RefSections 8.5, 8.6; Table 4

On long diagnostic queries GPT-oss achieved higher weighted aggregate than Llama.

NumbersGPT-oss 0.8099 vs Llama 0.7136; ∆=0.0963; p=0.0011

Practical UsePrefer the model with higher case-aware S_final for context-heavy enterprise tasks; test models on long multi-turn cases before rollout.

Evidence RefTable 2 (Long Queries) and Section 8.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Weighted Aggregate (Long queries)GPT-oss 0.8099; Llama 0.71360.0963Long queries (n=63 conversations)Table 2; Section 8.1Table 2
Weighted Aggregate (Short queries)GPT-oss 0.7353; Llama 0.72020.0151Short queries (n=70 conversations)Table 2; Section 8.1Table 2

What To Try In 7 Days

Run the JSON judge on 100 representative multi-turn cases to get per-metric baselines.

Set a conservative S_final release threshold and run pre-deployment checks on candidate models.

Add identifier integrity checks to your evaluation pipeline to catch corrupted commands/IDs early.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires representative multi-turn case logs and good case metadata (subject/description).

Rubric weights and severity bands must be tuned for each organization and risk tolerance.

When Not To Use

Small single-turn QA benchmarks where multi-turn workflow is irrelevant.

Open-domain evaluation where external knowledge beyond retrieved context is required.

Failure Modes

Case misidentification (wrong issue despite relevant text)

Workflow misalignment (violating required sequencing)

Core Entities

Models

Llama-3.3-70B-Instructgpt-oss-120bGPT-4 (judge via Azure OpenAI)

Metrics

HallucinationRetrieval CorrectnessContext SufficiencyAnswer HelpfulnessAnswer Type FitIdentifier IntegrityCase Issue IdentificationResolution AlignmentWeighted Aggregate (S_final)

Datasets

Anonymized enterprise support cases (short: 237 cases, long: 232 cases)