Overview
The framework is practical and reproducible on similar enterprise case sets; experiments show statistically significant long-query separation and strong judge-human agreement, but rubric weights and dataset representativeness need local calibration.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 70%
Novelty: 45%
Why It Matters For Business
Enterprise RAG failures often stem from workflow or identifier mistakes that standard metrics miss; this framework gives targeted diagnostics for release gating, regression tests, and monitoring to reduce costly production incidents.
Who Should Care
Summary TLDR
The paper introduces a practical, case-aware LLM-as-a-judge framework to evaluate multi-turn enterprise Retrieval-Augmented Generation (RAG) systems. It defines eight operational metrics (e.g., Hallucination, Retrieval Correctness, Identifier Integrity), a severity-aware scoring scheme, deterministic JSON outputs for auditability, and a single-LLM-per-turn batch pipeline. On an anonymized enterprise dataset, the framework revealed workflow and identifier failures that generic faithfulness/relevance metrics miss. In experiments, GPT-oss outperformed Llama on long diagnostic queries (weighted aggregate 0.8099 vs 0.7136, p=0.0011). The implementation is model-agnostic, designed for regression/g
Problem Statement
Standard RAG evaluation focuses on single-turn faithfulness or relevance and conflates retrieval, grounding, and resolution. Enterprise support workflows are multi-turn, require strict handling of structured identifiers (error codes, versions), and must follow prescribed troubleshooting order. Existing metrics miss operational failure modes like case misidentification, workflow misalignment, and partial resolution.
Main Contribution
Formalize enterprise multi-turn RAG evaluation needs and recurring operational failure modes.
Define eight case-aware metrics separating retrieval, grounding, answer utility, precision, and workflow alignment.
Key Findings
Case-aware metrics reveal workflow and identifier failures that proxy metrics miss.
On long diagnostic queries GPT-oss achieved higher weighted aggregate than Llama.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Weighted Aggregate (Long queries) | GPT-oss 0.8099; Llama 0.7136 | — | 0.0963 | Long queries (n=63 conversations) | Table 2; Section 8.1 | Table 2 |
| Weighted Aggregate (Short queries) | GPT-oss 0.7353; Llama 0.7202 | — | 0.0151 | Short queries (n=70 conversations) | Table 2; Section 8.1 | Table 2 |
What To Try In 7 Days
Run the JSON judge on 100 representative multi-turn cases to get per-metric baselines.
Set a conservative S_final release threshold and run pre-deployment checks on candidate models.
Add identifier integrity checks to your evaluation pipeline to catch corrupted commands/IDs early.
Reproducibility
Risks & Boundaries
Limitations
Requires representative multi-turn case logs and good case metadata (subject/description).
Rubric weights and severity bands must be tuned for each organization and risk tolerance.
When Not To Use
Small single-turn QA benchmarks where multi-turn workflow is irrelevant.
Open-domain evaluation where external knowledge beyond retrieved context is required.
Failure Modes
Case misidentification (wrong issue despite relevant text)
Workflow misalignment (violating required sequencing)

