Overview
The paper provides a clear, reproduced pipeline and human + automatic metrics on a 10M-token corpus; improvements are empirically supported but limited to one domain and a small eval set.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals12
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you build knowledge products for niche domains, add retrieval plus keyword-aware retrieval to raise factuality and credibility without expensive model retraining.
Who Should Care
Summary TLDR
The authors release VedantaNY-10M (10M tokens, 765 hours of Vedanta lectures) and build an in‑context retrieval-augmented chatbot. Compared to a non-RAG LLM, RAG responses are judged far more factual and complete (an 81% preference reported). A keyword-based hybrid retriever (combining sparse keyword signals with dense embeddings) plus a keyword-driven context refiner further improves relevance and faithfulness vs standard dense-RAG. The work highlights practical failure modes (transcription noise, spoken‑language fragments, retrieval-induced hallucinations) and provides code and dataset.
Problem Statement
Generic LLMs hallucinate and miss low-frequency niche terms. For specialized knowledge (here: Advaita Vedanta), we need a verified, updatable retrieval layer and retrieval strategies that surface rare domain terms and appropriate context length.
Main Contribution
VedantaNY-10M: a new 10M‑token dataset from ~765 hours of public Vedanta lectures (mostly English, ~3% Sanskrit transliterated).
A deployment recipe: in‑context RAG pipeline using retrieved passages + an LLM prompt (no finetuning of retriever/generator).
Key Findings
RAG responses were strongly preferred over non-RAG responses by experts.
Keyword-based hybrid retriever raised human-rated retrieval relevance from 0.59 to 0.82 (normalized).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Human preference (RAG vs non-RAG) | 81% preference for RAG | non-RAG | n/a | human evaluators across five question categories | Intro; Sec.5.3 | Intro; Sec.5.3 |
| Human retrieval relevance (normalized) | standard RAG 0.59 / keyword-RAG 0.82 | standard RAG | +0.23 | 25 triplets, human evaluation (Table 1) | Table 1 human evaluation | Table 1 |
What To Try In 7 Days
Create a small in-domain corpus (transcripts or docs) and index with dense embeddings.
Add a simple keyword extractor and combine sparse keyword scores with dense similarity (λ≈0.2 as paper used).
Implement a keyword-driven context refiner (trim/expand passages between first and last keyword) and run a quick human check.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Single teacher / single-source dataset limits generality across teachers and traditions.
Transcriptions contain errors, especially for Sanskrit terms and punctuation.
When Not To Use
When original source text must be in native script (Sanskrit Devanagari) rather than transliteration.
When latency must be minimal — retrieval and longer context increase token costs and latency.
Failure Modes
Irrelevant or incorrect retrievals that mislead the generator.
Retrieval-induced hallucinations where the model latches on to spurious phrases.

