Ragas: reference-free checks for RAG faithfulness, relevance, and context focus

September 26, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

65

Authors

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert

Links

Abstract / PDF

Why It Matters For Business

Provides fast, automated checks to catch ungrounded answers and noisy retrieval, reducing time spent on manual labeling and lowering hallucination risk in RAG deployments.

Summary TLDR

Ragas is an open framework that evaluates Retrieval-Augmented Generation (RAG) systems without needing human reference answers. It uses prompted LLMs (gpt-3.5-turbo-16k) and embeddings (text-embedding-ada-002) to score three practical dimensions: faithfulness (is the answer grounded in retrieved text), answer relevance (does the answer address the question), and context relevance (is the retrieved text focused). On a 50-page WikiEval dataset, Ragas aligns well with human judgments (faithfulness 0.95, answer relevance 0.78, context relevance 0.70). The code integrates with llama-index and LangChain and is available on GitHub.

Problem Statement

Evaluating RAG systems is hard when you lack ground-truth answers. Teams need quick, automated signals for whether retrieved context is useful, whether the LLM grounded its claims, and whether the generated answer actually addresses the question.

Main Contribution

Ragas: a practical, reference-free framework to score RAG outputs on faithfulness, answer relevance, and context relevance.

Concrete LLM-based scoring recipes: statement extraction + verification for faithfulness, question-generation + embedding similarity for answer relevance, and sentence extraction ratio for context relevance.

WikiEval: a new human-annotated dataset (50 Wikipedia pages) for benchmarking these dimensions and validating Ragas.

Open-source integrations with llama-index and LangChain and example prompts (GitHub).

Key Findings

Ragas matches human judgements on faithfulness with very high accuracy.

NumbersFaithfulness accuracy 0.95 on WikiEval (Table 1)

Ragas outperforms generic LLM scoring baselines on relevance and faithfulness.

NumbersRagas vs GPTScore accuracies: Faith 0.95 vs 0.72; AnsRel 0.78 vs 0.52 (Table 1)

Context relevance is the hardest dimension to evaluate automatically.

NumbersContext relevance accuracy 0.70 vs GPTScore 0.63 (Table 1)

Results

Faithfulness (Ragas)

Value0.95 accuracy

BaselineGPTScore 0.72 / GPT Ranking 0.54

Answer Relevance (Ragas)

Value0.78 accuracy

BaselineGPTScore 0.52 / GPT Ranking 0.40

Context Relevance (Ragas)

Value0.70 accuracy

BaselineGPTScore 0.63 / GPT Ranking 0.52

Who Should Care

What To Try In 7 Days

Run Ragas on a subset of your RAG outputs to surface ungrounded answers.

Integrate Ragas with your LangChain or llama-index pipeline for continuous diagnostics.

Use the faithfulness score to prioritize human review of risky answers.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on the judging LLM (gpt-3.5), so judgments inherit its biases and prompt sensitivity.
  • Context relevance is notably noisier than faithfulness and needs human checks for critical cases.
  • Evaluation was performed on WikiEval (50 pages); performance may vary on other domains or longer contexts.

When Not To Use

  • When you have high-quality human reference answers—use supervised metrics instead.
  • For safety-critical claims without human verification, since automated scores can miss subtle errors.
  • If you cannot access a capable LLM or embedding API (the framework depends on these).

Failure Modes

  • Judge bias: the scoring LLM can misclassify supported claims if prompts or examples are poor.
  • Long or noisy contexts can reduce context-relevance accuracy and mislead faithfulness checks.
  • Subtle differences in answer relevance are harder to detect, leading to lower agreement with humans.

Core Entities

Models

  • gpt-3.5-turbo-16k
  • text-embedding-ada-002
  • ChatGPT

Metrics

  • Faithfulness
  • Answer Relevance
  • Context Relevance
  • Ragas

Datasets

  • WikiEval