RAGBench: 100k explainable RAG examples plus TRACe — practical metrics to audit retriever+generator systems

Overview

Decision SnapshotReady For Pilot

The dataset and metrics are practical and well-validated against humans, but labels rely on a GPT-4 annotator and domain coverage is limited to the included verticals.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 55%

Authors

Robert Friel, Masha Belyi, Atindriyo Sanyal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAGBench + TRACe gives a unified, explainable way to audit retriever and generator components, reducing costly trial-and-error and surfacing whether errors come from the retriever, the generator, or both.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

RAGBench is a 100k-example, multi-domain dataset and evaluation suite for Retrieval-Augmented Generation (RAG). The authors introduce TRACe — four actionable metrics (Utilization, Relevance, Adherence, Completeness) — and release the labeled data and code. Labels are created with a GPT-4 annotator and validated against human judgments. A 400M-parameter DeBERTa model fine-tuned on RAGBench outperforms few-shot LLM judges on several RAG evaluation tasks on the provided test splits. The benchmark targets industry-style docs (manuals, contracts, papers) and aims to make RAG evaluation more granular and reproducible.

Problem Statement

There is no unified, large-scale, cross-domain benchmark or set of explainable metrics for evaluating RAG systems. Existing datasets are small, label sets are inconsistent, and many evaluation pipelines use LLMs to label data, which hinders reproducibility and practical system tuning.

Main Contribution

A large standardized RAG dataset (RAGBench) of ~100k examples from 12 component datasets across five industry-relevant domains.

TRACe — a concise, explainable RAG evaluation framework: Utilization, Relevance, Adherence, Completeness.

Key Findings

RAGBench totals approximately 100k labeled RAG examples.

Numbers100k total; Train 78k / Val 12k / Test 11k

Practical UseYou can train and evaluate RAG evaluators on a single, large cross-domain dataset instead of many small task-specific sets.

Evidence RefTable 1, §3.1

TRACe formalizes four RAG metrics that separate retriever vs generator behavior.

NumbersMetrics: Utilization, Relevance, Adherence, Completeness

Practical UseUse TRACe to get actionable, component-level signals (e.g., low Relevance → retriever tuning; low Utilization → generator/prompt work).

Evidence Ref§3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	≈100k examples (Train 78k / Val 12k / Test 11k)	—	—	RAGBench	Table 1, §3.1	Table 1
GPT-4 annotator alignment (adherence)	Example-level Acc 0.93; Span-level Acc 0.95	human labels	—	DelucionQA test	Table 2, §3.4	Table 2

What To Try In 7 Days

Run TRACe metrics on a small sample of your RAG queries to separate retriever vs generator issues.

Use the RAGBench Hugging Face dataset to fine-tune a small NLI-style judge (e.g., DeBERTa) and compare to your LLM prompts.

Validate a GPT-4 labeling pipeline on a 200–500 human-annotated subset before scaling auto-labeling for your domain.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/rungalileo/ragbench/tree/main/ragbench

Data URLs

https://huggingface.co/datasets/rungalileo/ragbench

Risks & Boundaries

Limitations

Labels are generated primarily by a GPT-4 annotator; despite high alignment, automatic labels can mis-handle partially-supported sentences.

Domain coverage is focused on 5 verticals; other domains or languages may behave differently.

When Not To Use

As undisputed ground truth without human validation — auto-labeler mistakes remain.

For non-English RAG systems — benchmark focuses on English sources.

Failure Modes

Partially-supported sentences labeled inconsistently, causing adherence misclassification.

Relevance labels mislead when retrieved docs are semantically related but lack required facts.

Core Entities

Models

gpt-3.5-turbogpt-4gpt-4oClaude 3 HaikuDeBERTa-v3-Large (400M)TF-IDF retriever

Metrics

RelevanceUtilizationAdherenceCompletenessAUROCRMSE

Datasets

RAGBenchPubMedQACovidQA-RAGHotpotQAMS MarcoHAGRIDExpertQACUADDelucionQAEManualTechQAFinQATAT-QA

Benchmarks

DelucionQARAGTruthRGBAttributionBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAGBench totals approximately 100k labeled RAG examples.

TRACe formalizes four RAG metrics that separate retriever vs generator behavior.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding