Overview
Production Readiness
0.7
Novelty Score
0.45
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
Enterprise RAG failures often stem from workflow or identifier mistakes that standard metrics miss; this framework gives targeted diagnostics for release gating, regression tests, and monitoring to reduce costly production incidents.
Summary TLDR
The paper introduces a practical, case-aware LLM-as-a-judge framework to evaluate multi-turn enterprise Retrieval-Augmented Generation (RAG) systems. It defines eight operational metrics (e.g., Hallucination, Retrieval Correctness, Identifier Integrity), a severity-aware scoring scheme, deterministic JSON outputs for auditability, and a single-LLM-per-turn batch pipeline. On an anonymized enterprise dataset, the framework revealed workflow and identifier failures that generic faithfulness/relevance metrics miss. In experiments, GPT-oss outperformed Llama on long diagnostic queries (weighted aggregate 0.8099 vs 0.7136, p=0.0011). The implementation is model-agnostic, designed for regression/g
Problem Statement
Standard RAG evaluation focuses on single-turn faithfulness or relevance and conflates retrieval, grounding, and resolution. Enterprise support workflows are multi-turn, require strict handling of structured identifiers (error codes, versions), and must follow prescribed troubleshooting order. Existing metrics miss operational failure modes like case misidentification, workflow misalignment, and partial resolution.
Main Contribution
Formalize enterprise multi-turn RAG evaluation needs and recurring operational failure modes.
Define eight case-aware metrics separating retrieval, grounding, answer utility, precision, and workflow alignment.
Introduce a severity-aware scoring protocol and weighted aggregation for monitoring and release gating.
Provide a deterministic, JSON-schema-based LLM-as-a-judge pipeline for scalable batch evaluation and auditability.
Empirically show the framework surfaces enterprise-critical tradeoffs missed by generic proxy metrics.
Key Findings
Case-aware metrics reveal workflow and identifier failures that proxy metrics miss.
On long diagnostic queries GPT-oss achieved higher weighted aggregate than Llama.
LLM judge aligns well with human experts on high-risk dimensions.
Severity-weighted aggregation reduces score inflation from partial correctness.
Per-turn judge cost is small but linear in number of turns.
Results
Weighted Aggregate (Long queries)
Weighted Aggregate (Short queries)
Human-judge agreement on critical dimensions
Per-turn evaluation cost (example)
Who Should Care
What To Try In 7 Days
Run the JSON judge on 100 representative multi-turn cases to get per-metric baselines.
Set a conservative S_final release threshold and run pre-deployment checks on candidate models.
Add identifier integrity checks to your evaluation pipeline to catch corrupted commands/IDs early.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires representative multi-turn case logs and good case metadata (subject/description).
- Rubric weights and severity bands must be tuned for each organization and risk tolerance.
- Raw enterprise logs are confidential; external replication needs synthetic or de-identified cases.
- LLM judge can still show judge-sensitivity and borderline disagreement; human spot checks remain necessary.
When Not To Use
- Small single-turn QA benchmarks where multi-turn workflow is irrelevant.
- Open-domain evaluation where external knowledge beyond retrieved context is required.
- Settings without structured case metadata or representative retrieval evidence.
Failure Modes
- Case misidentification (wrong issue despite relevant text)
- Workflow misalignment (violating required sequencing)
- Identifier corruption (altered error codes/commands)
- Judge bias/verbosity causing unstable justifications
Core Entities
Models
- Llama-3.3-70B-Instruct
- gpt-oss-120b
- GPT-4 (judge via Azure OpenAI)
Metrics
- Hallucination
- Retrieval Correctness
- Context Sufficiency
- Answer Helpfulness
- Answer Type Fit
- Identifier Integrity
- Case Issue Identification
- Resolution Alignment
- Weighted Aggregate (S_final)
Datasets
- Anonymized enterprise support cases (short: 237 cases, long: 232 cases)

