RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

August 21, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Priyanka Mandikal

Links

Abstract / PDF

Why It Matters For Business

If you build knowledge products for niche domains, add retrieval plus keyword-aware retrieval to raise factuality and credibility without expensive model retraining.

Summary TLDR

The authors release VedantaNY-10M (10M tokens, 765 hours of Vedanta lectures) and build an in‑context retrieval-augmented chatbot. Compared to a non-RAG LLM, RAG responses are judged far more factual and complete (an 81% preference reported). A keyword-based hybrid retriever (combining sparse keyword signals with dense embeddings) plus a keyword-driven context refiner further improves relevance and faithfulness vs standard dense-RAG. The work highlights practical failure modes (transcription noise, spoken‑language fragments, retrieval-induced hallucinations) and provides code and dataset.

Problem Statement

Generic LLMs hallucinate and miss low-frequency niche terms. For specialized knowledge (here: Advaita Vedanta), we need a verified, updatable retrieval layer and retrieval strategies that surface rare domain terms and appropriate context length.

Main Contribution

VedantaNY-10M: a new 10M‑token dataset from ~765 hours of public Vedanta lectures (mostly English, ~3% Sanskrit transliterated).

A deployment recipe: in‑context RAG pipeline using retrieved passages + an LLM prompt (no finetuning of retriever/generator).

A keyword-based hybrid retriever that upweights sparse keyword signals to recover low-frequency Sanskrit/domain terms.

A keyword-based context refiner that expands or trims retrieved passages to include first-to-last keyword occurrences.

Extensive automatic and human evaluation showing RAG > non-RAG and keyword‑hybrid RAG > standard dense RAG.

Key Findings

RAG responses were strongly preferred over non-RAG responses by experts.

Numbers81% preference rate (reported by domain experts)

Keyword-based hybrid retriever raised human-rated retrieval relevance from 0.59 to 0.82 (normalized).

NumbersHuman relevance 0.59 → 0.82 (standard RAG vs keyword-RAG)

Keyword-RAG improved faithfulness metrics (QAFactEval) on the evaluation set.

NumbersQAFactEval 1.36 → 1.60 (standard RAG vs keyword-RAG, evaluated on 25 triplets)

Results

Human preference (RAG vs non-RAG)

Value81% preference for RAG

Baselinenon-RAG

Human retrieval relevance (normalized)

Valuestandard RAG 0.59 / keyword-RAG 0.82

Baselinestandard RAG

QAFactEval (answer vs retrieved evidence)

Valuestandard RAG 1.36 / keyword-RAG 1.60

Baselinestandard RAG

GPT2 perplexity (lower better)

Valuestandard RAG 16.6 / keyword-RAG 15.3 (overall)

Baselinestandard RAG

Who Should Care

What To Try In 7 Days

Create a small in-domain corpus (transcripts or docs) and index with dense embeddings.

Add a simple keyword extractor and combine sparse keyword scores with dense similarity (λ≈0.2 as paper used).

Implement a keyword-driven context refiner (trim/expand passages between first and last keyword) and run a quick human check.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single teacher / single-source dataset limits generality across teachers and traditions.
  • Transcriptions contain errors, especially for Sanskrit terms and punctuation.
  • Dataset is spoken rather than written text; spoken fragments cause retrieval/context noise.
  • Context refiner is heuristic-based and may not generalize; needs trained summarizer or refiner.
  • Evaluation set is small (25 triplets in some automatic comparisons) and lacks gold references.

When Not To Use

  • When original source text must be in native script (Sanskrit Devanagari) rather than transliteration.
  • When latency must be minimal — retrieval and longer context increase token costs and latency.
  • When domain sources are already well covered in model pretraining and parametric memory suffices.

Failure Modes

  • Irrelevant or incorrect retrievals that mislead the generator.
  • Retrieval-induced hallucinations where the model latches on to spurious phrases.
  • Errors from noisy automatic transcripts causing wrong facts or attributions.
  • Fixed-length passage retrievals that cut off needed context or include distracting text.

Core Entities

Models

  • GPT-4-turbo
  • Mixtral-8x7B-Instruct-v0.1
  • text-embedding-ada-002
  • nomic-embed-textv1
  • Whisper large-v2
  • T5-XXL (RankGen encoder)
  • GPT-2 (perplexity used)
  • BART-large
  • Electra-large

Metrics

  • QAFactEval
  • RankGen
  • Self-BLEU
  • GPT2-perplexity
  • Word/sentence counts
  • Human scores: relevance, correctness, completeness

Datasets

  • VedantaNY-10M (10M tokens, 765 hours of Vedanta Society NY YouTube transcripts)