Using a targeted RAG pipeline and curated CMU dataset to reduce LLM hallucinations on domain queries

March 15, 20247 min

Overview

Decision SnapshotNeeds Validation

The pipeline shows clear RAG and embedding gains on a specific CMU dataset, but limited dataset size, annotation noise, and small generator finetune steps reduce production readiness and generalizability.

Citations19

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 40%

Authors

Jiarui Li, Ye Yuan, Zehua Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.

Who Should Care

Summary TLDR

The authors build a complete Retrieval-Augmented Generation (RAG) QA system over a curated CMU/LTI knowledge base to reduce hallucinations on domain and time-sensitive queries. They crawl CMU sites, generate 34,781 QA pairs with an LLM annotator, and evaluate variants: baseline LLM, RAG, embedding fine-tune, generator fine-tune, and both. RAG improves recall and F1 vs baseline; embedding fine-tuning gives extra gains; generator finetuning raises recall but can hurt F1 when the finetune data is small and biased. Code/models are claimed available on GitHub.

Problem Statement

Off-the-shelf LLMs often hallucinate on domain-specific or time-sensitive questions. The paper asks whether adding a curated, private knowledge base plus a RAG pipeline and targeted fine-tuning improves factual accuracy for CMU/LTI queries.

Main Contribution

A complete RAG QA system built over a crawled CMU / LTI knowledge base, including crawler, storage, retriever, reranker, and generator.

A large automatic annotation process producing 34,781 QA pairs (27,824 train, 6,957 test) using WizardLM as annotator and a Cohen's Kappa = 0.67 check.

Key Findings

Adding RAG boosts retrieval and answer quality over the baseline LLM.

NumbersRecall 0.361 -> 0.409; F1 0.186 -> 0.289

Practical UseUse a retrieval step (RAG) to raise factual recall and F1 on domain queries; it's a low-risk first step before model finetuning.

Evidence RefTable 1

Fine-tuning the embedding model yields further gains.

NumbersRecall 0.409 -> 0.437; F1 0.289 -> 0.304

Practical UseFine-tune embeddings on your domain QA pairs to improve retrieval relevance; this often gives measurable gains without changing the generator.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
RecallBaseline 0.3610.069); Raw RAG 0.4090.081); +Emb 0.4370.076); +Core 0.4480.106); +Both 0.4520.107)0.361Best 0.452 (+0.091 vs baseline)Local human-evaluated test set (128 QA samples per run)Table 1 reports mean and SD from 4 runsTable 1
F1 ScoreBaseline 0.1860.032); Raw RAG 0.2890.065); +Emb 0.3040.063); +Core 0.2110.056); +Both 0.2190.060)0.186Raw RAG +0.103; +Emb +0.118; +Core +0.025Local human-evaluated test setTable 1 shows F1 and SD over 4 runsTable 1

What To Try In 7 Days

Crawl your domain docs, filter noisy pages, and build a small KB.

Add off-the-shelf embeddings + a simple retriever and test RAG vs baseline on 100 domain QA samples.

Fine-tune embeddings on your QA pairs before touching generator finetuning.

Optimization Features

Model Optimization
INT4 quantization for LLaMA-2-7BLoRA
Training Optimization
Embedding fine-tune with MultipleNegativesRankingLossLoRA

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

GitHub (paper states code/models available; no URL provided in text)

Risks & Boundaries

Limitations

Training data is auto-annotated and moderately noisy (Cohen's Kappa = 0.67).

Generator finetuning used limited compute and a small 7B model; results may not scale.

When Not To Use

For open-domain or web-wide QA where a representative KB cannot be built.

When you lack a reasonably large, high-quality domain dataset for generator finetuning.

Failure Modes

Overfitting to small biased finetune data, reducing generation quality.

Repetitive or filler tokens injected from dataset formatting ("context:", "answer:").

Core Entities

Models

meta-llama/Llama-2-7b-chat-hf (LLaMA-2-7B)mixedbread-ai/mxbai-embed-large-v1 (embedder)BAAI/bge-reranker-large (BgeRerank)WizardLM (annotation model)GPT4All (annotation candidate)

Metrics

RecallF1 ScoreCosine SimilarityBLEU

Datasets

Curated CMU / LTI crawl (html + pdf + papers)Generated QA pairs (34,781 total; 27,824 train, 6,957 test)

Benchmarks

Local human-evaluated test set (random 128 QA samples per run)

Context Entities

Models

WizardLM (used for automatic annotation)GPT4All, LLaMA-2 (evaluated as annotators)

Metrics

Cohen's Kappa (annotation agreement)

Datasets

Semantic Scholar 2023 papers for LTI facultyCMU website pages filtered with CMU/LTI keywords