Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

February 23, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.45

Cost Impact Score

0.3

Citation Count

0

Authors

Mukul Chhabra, Luigi Medrano, Arush Verma

Links

Abstract / PDF

Why It Matters For Business

Enterprise RAG failures often stem from workflow or identifier mistakes that standard metrics miss; this framework gives targeted diagnostics for release gating, regression tests, and monitoring to reduce costly production incidents.

Summary TLDR

The paper introduces a practical, case-aware LLM-as-a-judge framework to evaluate multi-turn enterprise Retrieval-Augmented Generation (RAG) systems. It defines eight operational metrics (e.g., Hallucination, Retrieval Correctness, Identifier Integrity), a severity-aware scoring scheme, deterministic JSON outputs for auditability, and a single-LLM-per-turn batch pipeline. On an anonymized enterprise dataset, the framework revealed workflow and identifier failures that generic faithfulness/relevance metrics miss. In experiments, GPT-oss outperformed Llama on long diagnostic queries (weighted aggregate 0.8099 vs 0.7136, p=0.0011). The implementation is model-agnostic, designed for regression/g

Problem Statement

Standard RAG evaluation focuses on single-turn faithfulness or relevance and conflates retrieval, grounding, and resolution. Enterprise support workflows are multi-turn, require strict handling of structured identifiers (error codes, versions), and must follow prescribed troubleshooting order. Existing metrics miss operational failure modes like case misidentification, workflow misalignment, and partial resolution.

Main Contribution

Formalize enterprise multi-turn RAG evaluation needs and recurring operational failure modes.

Define eight case-aware metrics separating retrieval, grounding, answer utility, precision, and workflow alignment.

Introduce a severity-aware scoring protocol and weighted aggregation for monitoring and release gating.

Provide a deterministic, JSON-schema-based LLM-as-a-judge pipeline for scalable batch evaluation and auditability.

Empirically show the framework surfaces enterprise-critical tradeoffs missed by generic proxy metrics.

Key Findings

Case-aware metrics reveal workflow and identifier failures that proxy metrics miss.

On long diagnostic queries GPT-oss achieved higher weighted aggregate than Llama.

NumbersGPT-oss 0.8099 vs Llama 0.7136; ∆=0.0963; p=0.0011

LLM judge aligns well with human experts on high-risk dimensions.

NumbersHallucination 88%, Identifier Integrity 91%, Resolution Alignment 84%

Severity-weighted aggregation reduces score inflation from partial correctness.

Per-turn judge cost is small but linear in number of turns.

Numbers≈ $0.014 per judged turn (3,000 input tokens, 400 output tokens, GPT pricing example)

Results

Weighted Aggregate (Long queries)

ValueGPT-oss 0.8099; Llama 0.7136

Weighted Aggregate (Short queries)

ValueGPT-oss 0.7353; Llama 0.7202

Human-judge agreement on critical dimensions

ValueHallucination 88%; Identifier Integrity 91%; Resolution Alignment 84%

Per-turn evaluation cost (example)

Value$0.014 per turn

Who Should Care

What To Try In 7 Days

Run the JSON judge on 100 representative multi-turn cases to get per-metric baselines.

Set a conservative S_final release threshold and run pre-deployment checks on candidate models.

Add identifier integrity checks to your evaluation pipeline to catch corrupted commands/IDs early.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires representative multi-turn case logs and good case metadata (subject/description).
  • Rubric weights and severity bands must be tuned for each organization and risk tolerance.
  • Raw enterprise logs are confidential; external replication needs synthetic or de-identified cases.
  • LLM judge can still show judge-sensitivity and borderline disagreement; human spot checks remain necessary.

When Not To Use

  • Small single-turn QA benchmarks where multi-turn workflow is irrelevant.
  • Open-domain evaluation where external knowledge beyond retrieved context is required.
  • Settings without structured case metadata or representative retrieval evidence.

Failure Modes

  • Case misidentification (wrong issue despite relevant text)
  • Workflow misalignment (violating required sequencing)
  • Identifier corruption (altered error codes/commands)
  • Judge bias/verbosity causing unstable justifications

Core Entities

Models

  • Llama-3.3-70B-Instruct
  • gpt-oss-120b
  • GPT-4 (judge via Azure OpenAI)

Metrics

  • Hallucination
  • Retrieval Correctness
  • Context Sufficiency
  • Answer Helpfulness
  • Answer Type Fit
  • Identifier Integrity
  • Case Issue Identification
  • Resolution Alignment
  • Weighted Aggregate (S_final)

Datasets

  • Anonymized enterprise support cases (short: 237 cases, long: 232 cases)