Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
19
Why It Matters For Business
Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.
Summary TLDR
The authors build a complete Retrieval-Augmented Generation (RAG) QA system over a curated CMU/LTI knowledge base to reduce hallucinations on domain and time-sensitive queries. They crawl CMU sites, generate 34,781 QA pairs with an LLM annotator, and evaluate variants: baseline LLM, RAG, embedding fine-tune, generator fine-tune, and both. RAG improves recall and F1 vs baseline; embedding fine-tuning gives extra gains; generator finetuning raises recall but can hurt F1 when the finetune data is small and biased. Code/models are claimed available on GitHub.
Problem Statement
Off-the-shelf LLMs often hallucinate on domain-specific or time-sensitive questions. The paper asks whether adding a curated, private knowledge base plus a RAG pipeline and targeted fine-tuning improves factual accuracy for CMU/LTI queries.
Main Contribution
A complete RAG QA system built over a crawled CMU / LTI knowledge base, including crawler, storage, retriever, reranker, and generator.
A large automatic annotation process producing 34,781 QA pairs (27,824 train, 6,957 test) using WizardLM as annotator and a Cohen's Kappa = 0.67 check.
An ablation evaluation showing RAG and embedding finetuning improve factual metrics, while generator finetuning on a small biased dataset can hurt answer quality.
Key Findings
Adding RAG boosts retrieval and answer quality over the baseline LLM.
Fine-tuning the embedding model yields further gains.
Fine-tuning the generator increased recall but lowered F1 and fluency in some cases.
The curated dataset size and annotation quality are moderate.
Results
Recall
F1 Score
BLEU
Who Should Care
What To Try In 7 Days
Crawl your domain docs, filter noisy pages, and build a small KB.
Add off-the-shelf embeddings + a simple retriever and test RAG vs baseline on 100 domain QA samples.
Fine-tune embeddings on your QA pairs before touching generator finetuning.
Optimization Features
Model Optimization
- INT4 quantization for LLaMA-2-7B
- LoRA
Training Optimization
- Embedding fine-tune with MultipleNegativesRankingLoss
- LoRA
Reproducibility
Code Urls
- GitHub (paper states code/models available; no URL provided in text)
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training data is auto-annotated and moderately noisy (Cohen's Kappa = 0.67).
- Generator finetuning used limited compute and a small 7B model; results may not scale.
- Dataset is domain-specific and biased toward CMU/LTI content.
- Fine-tuned generator produced verbose or templated outputs in examples.
When Not To Use
- For open-domain or web-wide QA where a representative KB cannot be built.
- When you lack a reasonably large, high-quality domain dataset for generator finetuning.
- When strict brevity or fluency is more important than recall without further validation.
Failure Modes
- Overfitting to small biased finetune data, reducing generation quality.
- Repetitive or filler tokens injected from dataset formatting ("context:", "answer:").
- Remaining hallucinations when retriever returns irrelevant chunks.
Core Entities
Models
- meta-llama/Llama-2-7b-chat-hf (LLaMA-2-7B)
- mixedbread-ai/mxbai-embed-large-v1 (embedder)
- BAAI/bge-reranker-large (BgeRerank)
- WizardLM (annotation model)
- GPT4All (annotation candidate)
Metrics
- Recall
- F1 Score
- Cosine Similarity
- BLEU
Datasets
- Curated CMU / LTI crawl (html + pdf + papers)
- Generated QA pairs (34,781 total; 27,824 train, 6,957 test)
Benchmarks
- Local human-evaluated test set (random 128 QA samples per run)
Context Entities
Models
- WizardLM (used for automatic annotation)
- GPT4All, LLaMA-2 (evaluated as annotators)
Metrics
- Cohen's Kappa (annotation agreement)
Datasets
- Semantic Scholar 2023 papers for LTI faculty
- CMU website pages filtered with CMU/LTI keywords

