Ragas: reference-free checks for RAG faithfulness, relevance, and context focus

Overview

Decision SnapshotReady For Pilot

Ragas gives reliable, reference-free signals (best for faithfulness); use it as an automated triage tool, not a final ground-truth replacement.

Citations65

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Provides fast, automated checks to catch ungrounded answers and noisy retrieval, reducing time spent on manual labeling and lowering hallucination risk in RAG deployments.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

Ragas is an open framework that evaluates Retrieval-Augmented Generation (RAG) systems without needing human reference answers. It uses prompted LLMs (gpt-3.5-turbo-16k) and embeddings (text-embedding-ada-002) to score three practical dimensions: faithfulness (is the answer grounded in retrieved text), answer relevance (does the answer address the question), and context relevance (is the retrieved text focused). On a 50-page WikiEval dataset, Ragas aligns well with human judgments (faithfulness 0.95, answer relevance 0.78, context relevance 0.70). The code integrates with llama-index and LangChain and is available on GitHub.

Problem Statement

Evaluating RAG systems is hard when you lack ground-truth answers. Teams need quick, automated signals for whether retrieved context is useful, whether the LLM grounded its claims, and whether the generated answer actually addresses the question.

Main Contribution

Ragas: a practical, reference-free framework to score RAG outputs on faithfulness, answer relevance, and context relevance.

Concrete LLM-based scoring recipes: statement extraction + verification for faithfulness, question-generation + embedding similarity for answer relevance, and sentence extraction ratio for context relevance.

Key Findings

Ragas matches human judgements on faithfulness with very high accuracy.

NumbersFaithfulness accuracy 0.95 on WikiEval (Table 1)

Practical UseUse Ragas faithfulness checks to quickly find answers that are unsupported by retrieved text before deployment.

Evidence RefTable 1

Ragas outperforms generic LLM scoring baselines on relevance and faithfulness.

NumbersRagas vs GPTScore accuracies: Faith 0.95 vs 0.72; AnsRel 0.78 vs 0.52 (Table 1)

Practical UsePrefer structured Ragas prompts over asking ChatGPT to score directly when you need automated quality signals.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Faithfulness (Ragas)	0.95 accuracy	GPTScore 0.72 / GPT Ranking 0.54	+0.23 vs GPTScore	WikiEval (pairwise comparisons)	Ragas faithfulness agrees with human annotators 95% of the time	Table 1
Answer Relevance (Ragas)	0.78 accuracy	GPTScore 0.52 / GPT Ranking 0.40	+0.26 vs GPTScore	WikiEval (pairwise comparisons)	Ragas answer relevance agrees with humans 78% of the time	Table 1

What To Try In 7 Days

Run Ragas on a subset of your RAG outputs to surface ungrounded answers.

Integrate Ragas with your LangChain or llama-index pipeline for continuous diagnostics.

Use the faithfulness score to prioritize human review of risky answers.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/explodinggradients/ragas

Data URLs

https://huggingface.co/datasets/explodinggradients/WikiEval

Risks & Boundaries

Limitations

Relies on the judging LLM (gpt-3.5), so judgments inherit its biases and prompt sensitivity.

Context relevance is notably noisier than faithfulness and needs human checks for critical cases.

When Not To Use

When you have high-quality human reference answers—use supervised metrics instead.

For safety-critical claims without human verification, since automated scores can miss subtle errors.

Failure Modes

Judge bias: the scoring LLM can misclassify supported claims if prompts or examples are poor.

Long or noisy contexts can reduce context-relevance accuracy and mislead faithfulness checks.

Core Entities

Models

gpt-3.5-turbo-16ktext-embedding-ada-002ChatGPT

Metrics

FaithfulnessAnswer RelevanceContext RelevanceRagas

Datasets

WikiEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Ragas matches human judgements on faithfulness with very high accuracy.

Ragas outperforms generic LLM scoring baselines on relevance and faithfulness.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding