Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If you build knowledge products for niche domains, add retrieval plus keyword-aware retrieval to raise factuality and credibility without expensive model retraining.
Summary TLDR
The authors release VedantaNY-10M (10M tokens, 765 hours of Vedanta lectures) and build an in‑context retrieval-augmented chatbot. Compared to a non-RAG LLM, RAG responses are judged far more factual and complete (an 81% preference reported). A keyword-based hybrid retriever (combining sparse keyword signals with dense embeddings) plus a keyword-driven context refiner further improves relevance and faithfulness vs standard dense-RAG. The work highlights practical failure modes (transcription noise, spoken‑language fragments, retrieval-induced hallucinations) and provides code and dataset.
Problem Statement
Generic LLMs hallucinate and miss low-frequency niche terms. For specialized knowledge (here: Advaita Vedanta), we need a verified, updatable retrieval layer and retrieval strategies that surface rare domain terms and appropriate context length.
Main Contribution
VedantaNY-10M: a new 10M‑token dataset from ~765 hours of public Vedanta lectures (mostly English, ~3% Sanskrit transliterated).
A deployment recipe: in‑context RAG pipeline using retrieved passages + an LLM prompt (no finetuning of retriever/generator).
A keyword-based hybrid retriever that upweights sparse keyword signals to recover low-frequency Sanskrit/domain terms.
A keyword-based context refiner that expands or trims retrieved passages to include first-to-last keyword occurrences.
Extensive automatic and human evaluation showing RAG > non-RAG and keyword‑hybrid RAG > standard dense RAG.
Key Findings
RAG responses were strongly preferred over non-RAG responses by experts.
Keyword-based hybrid retriever raised human-rated retrieval relevance from 0.59 to 0.82 (normalized).
Keyword-RAG improved faithfulness metrics (QAFactEval) on the evaluation set.
Results
Human preference (RAG vs non-RAG)
Human retrieval relevance (normalized)
QAFactEval (answer vs retrieved evidence)
GPT2 perplexity (lower better)
Who Should Care
What To Try In 7 Days
Create a small in-domain corpus (transcripts or docs) and index with dense embeddings.
Add a simple keyword extractor and combine sparse keyword scores with dense similarity (λ≈0.2 as paper used).
Implement a keyword-driven context refiner (trim/expand passages between first and last keyword) and run a quick human check.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single teacher / single-source dataset limits generality across teachers and traditions.
- Transcriptions contain errors, especially for Sanskrit terms and punctuation.
- Dataset is spoken rather than written text; spoken fragments cause retrieval/context noise.
- Context refiner is heuristic-based and may not generalize; needs trained summarizer or refiner.
- Evaluation set is small (25 triplets in some automatic comparisons) and lacks gold references.
When Not To Use
- When original source text must be in native script (Sanskrit Devanagari) rather than transliteration.
- When latency must be minimal — retrieval and longer context increase token costs and latency.
- When domain sources are already well covered in model pretraining and parametric memory suffices.
Failure Modes
- Irrelevant or incorrect retrievals that mislead the generator.
- Retrieval-induced hallucinations where the model latches on to spurious phrases.
- Errors from noisy automatic transcripts causing wrong facts or attributions.
- Fixed-length passage retrievals that cut off needed context or include distracting text.
Core Entities
Models
- GPT-4-turbo
- Mixtral-8x7B-Instruct-v0.1
- text-embedding-ada-002
- nomic-embed-textv1
- Whisper large-v2
- T5-XXL (RankGen encoder)
- GPT-2 (perplexity used)
- BART-large
- Electra-large
Metrics
- QAFactEval
- RankGen
- Self-BLEU
- GPT2-perplexity
- Word/sentence counts
- Human scores: relevance, correctness, completeness
Datasets
- VedantaNY-10M (10M tokens, 765 hours of Vedanta Society NY YouTube transcripts)

