RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Overview

Decision SnapshotNeeds Validation

The paper provides a clear, reproduced pipeline and human + automatic metrics on a 10M-token corpus; improvements are empirically supported but limited to one domain and a small eval set.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Priyanka Mandikal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build knowledge products for niche domains, add retrieval plus keyword-aware retrieval to raise factuality and credibility without expensive model retraining.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors release VedantaNY-10M (10M tokens, 765 hours of Vedanta lectures) and build an in‑context retrieval-augmented chatbot. Compared to a non-RAG LLM, RAG responses are judged far more factual and complete (an 81% preference reported). A keyword-based hybrid retriever (combining sparse keyword signals with dense embeddings) plus a keyword-driven context refiner further improves relevance and faithfulness vs standard dense-RAG. The work highlights practical failure modes (transcription noise, spoken‑language fragments, retrieval-induced hallucinations) and provides code and dataset.

Problem Statement

Generic LLMs hallucinate and miss low-frequency niche terms. For specialized knowledge (here: Advaita Vedanta), we need a verified, updatable retrieval layer and retrieval strategies that surface rare domain terms and appropriate context length.

Main Contribution

VedantaNY-10M: a new 10M‑token dataset from ~765 hours of public Vedanta lectures (mostly English, ~3% Sanskrit transliterated).

A deployment recipe: in‑context RAG pipeline using retrieved passages + an LLM prompt (no finetuning of retriever/generator).

Key Findings

RAG responses were strongly preferred over non-RAG responses by experts.

Numbers81% preference rate (reported by domain experts)

Practical UseUse a retrieval layer for niche long-form QA to reduce hallucinations and increase expert trust.

Evidence RefIntro; Sec.5.3 human evaluation

Keyword-based hybrid retriever raised human-rated retrieval relevance from 0.59 to 0.82 (normalized).

NumbersHuman relevance 0.59 → 0.82 (standard RAG vs keyword-RAG)

Practical UseCombine sparse keyword signals with dense embeddings to find low-frequency domain terms and boost retrieval quality.

Evidence RefTable 1 (Human evaluation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human preference (RAG vs non-RAG)	81% preference for RAG	non-RAG	n/a	human evaluators across five question categories	Intro; Sec.5.3	Intro; Sec.5.3
Human retrieval relevance (normalized)	standard RAG 0.59 / keyword-RAG 0.82	standard RAG	+0.23	25 triplets, human evaluation (Table 1)	Table 1 human evaluation	Table 1

What To Try In 7 Days

Create a small in-domain corpus (transcripts or docs) and index with dense embeddings.

Add a simple keyword extractor and combine sparse keyword scores with dense similarity (λ≈0.2 as paper used).

Implement a keyword-driven context refiner (trim/expand passages between first and last keyword) and run a quick human check.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://sites.google.com/view/vedantany-10m https://github.com/priyankamandikal/vedantany-10m

Data URLs

https://sites.google.com/view/vedantany-10m https://github.com/priyankamandikal/vedantany-10m

Risks & Boundaries

Limitations

Single teacher / single-source dataset limits generality across teachers and traditions.

Transcriptions contain errors, especially for Sanskrit terms and punctuation.

When Not To Use

When original source text must be in native script (Sanskrit Devanagari) rather than transliteration.

When latency must be minimal — retrieval and longer context increase token costs and latency.

Failure Modes

Irrelevant or incorrect retrievals that mislead the generator.

Retrieval-induced hallucinations where the model latches on to spurious phrases.

Core Entities

Models

GPT-4-turboMixtral-8x7B-Instruct-v0.1text-embedding-ada-002nomic-embed-textv1Whisper large-v2T5-XXL (RankGen encoder)GPT-2 (perplexity used)BART-largeElectra-large

Metrics

QAFactEvalRankGenSelf-BLEUGPT2-perplexityWord/sentence countsHuman scores: relevance, correctness, completeness

Datasets

VedantaNY-10M (10M tokens, 765 hours of Vedanta Society NY YouTube transcripts)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAG responses were strongly preferred over non-RAG responses by experts.

Keyword-based hybrid retriever raised human-rated retrieval relevance from 0.59 to 0.82 (normalized).

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

A practical RAG-based pipeline to turn Kenya's primary-care guidelines into a living LLM benchmark and reasoning stress-tests

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f