RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

August 21, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper provides a clear, reproduced pipeline and human + automatic metrics on a 10M-token corpus; improvements are empirically supported but limited to one domain and a small eval set.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Priyanka Mandikal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build knowledge products for niche domains, add retrieval plus keyword-aware retrieval to raise factuality and credibility without expensive model retraining.

Who Should Care

Summary TLDR

The authors release VedantaNY-10M (10M tokens, 765 hours of Vedanta lectures) and build an in‑context retrieval-augmented chatbot. Compared to a non-RAG LLM, RAG responses are judged far more factual and complete (an 81% preference reported). A keyword-based hybrid retriever (combining sparse keyword signals with dense embeddings) plus a keyword-driven context refiner further improves relevance and faithfulness vs standard dense-RAG. The work highlights practical failure modes (transcription noise, spoken‑language fragments, retrieval-induced hallucinations) and provides code and dataset.

Problem Statement

Generic LLMs hallucinate and miss low-frequency niche terms. For specialized knowledge (here: Advaita Vedanta), we need a verified, updatable retrieval layer and retrieval strategies that surface rare domain terms and appropriate context length.

Main Contribution

VedantaNY-10M: a new 10M‑token dataset from ~765 hours of public Vedanta lectures (mostly English, ~3% Sanskrit transliterated).

A deployment recipe: in‑context RAG pipeline using retrieved passages + an LLM prompt (no finetuning of retriever/generator).

Key Findings

RAG responses were strongly preferred over non-RAG responses by experts.

Numbers81% preference rate (reported by domain experts)

Practical UseUse a retrieval layer for niche long-form QA to reduce hallucinations and increase expert trust.

Evidence RefIntro; Sec.5.3 human evaluation

Keyword-based hybrid retriever raised human-rated retrieval relevance from 0.59 to 0.82 (normalized).

NumbersHuman relevance 0.590.82 (standard RAG vs keyword-RAG)

Practical UseCombine sparse keyword signals with dense embeddings to find low-frequency domain terms and boost retrieval quality.

Evidence RefTable 1 (Human evaluation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human preference (RAG vs non-RAG)81% preference for RAGnon-RAGn/ahuman evaluators across five question categoriesIntro; Sec.5.3Intro; Sec.5.3
Human retrieval relevance (normalized)standard RAG 0.59 / keyword-RAG 0.82standard RAG+0.2325 triplets, human evaluation (Table 1)Table 1 human evaluationTable 1

What To Try In 7 Days

Create a small in-domain corpus (transcripts or docs) and index with dense embeddings.

Add a simple keyword extractor and combine sparse keyword scores with dense similarity (λ≈0.2 as paper used).

Implement a keyword-driven context refiner (trim/expand passages between first and last keyword) and run a quick human check.

Reproducibility

Risks & Boundaries

Limitations

Single teacher / single-source dataset limits generality across teachers and traditions.

Transcriptions contain errors, especially for Sanskrit terms and punctuation.

When Not To Use

When original source text must be in native script (Sanskrit Devanagari) rather than transliteration.

When latency must be minimal — retrieval and longer context increase token costs and latency.

Failure Modes

Irrelevant or incorrect retrievals that mislead the generator.

Retrieval-induced hallucinations where the model latches on to spurious phrases.

Core Entities

Models

GPT-4-turboMixtral-8x7B-Instruct-v0.1text-embedding-ada-002nomic-embed-textv1Whisper large-v2T5-XXL (RankGen encoder)GPT-2 (perplexity used)BART-largeElectra-large

Metrics

QAFactEvalRankGenSelf-BLEUGPT2-perplexityWord/sentence countsHuman scores: relevance, correctness, completeness

Datasets

VedantaNY-10M (10M tokens, 765 hours of Vedanta Society NY YouTube transcripts)