Using a targeted RAG pipeline and curated CMU dataset to reduce LLM hallucinations on domain queries

Overview

Decision SnapshotNeeds Validation

The pipeline shows clear RAG and embedding gains on a specific CMU dataset, but limited dataset size, annotation noise, and small generator finetune steps reduce production readiness and generalizability.

Citations19

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 40%

Authors

Jiarui Li, Ye Yuan, Zehua Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The authors build a complete Retrieval-Augmented Generation (RAG) QA system over a curated CMU/LTI knowledge base to reduce hallucinations on domain and time-sensitive queries. They crawl CMU sites, generate 34,781 QA pairs with an LLM annotator, and evaluate variants: baseline LLM, RAG, embedding fine-tune, generator fine-tune, and both. RAG improves recall and F1 vs baseline; embedding fine-tuning gives extra gains; generator finetuning raises recall but can hurt F1 when the finetune data is small and biased. Code/models are claimed available on GitHub.

Problem Statement

Off-the-shelf LLMs often hallucinate on domain-specific or time-sensitive questions. The paper asks whether adding a curated, private knowledge base plus a RAG pipeline and targeted fine-tuning improves factual accuracy for CMU/LTI queries.

Main Contribution

A complete RAG QA system built over a crawled CMU / LTI knowledge base, including crawler, storage, retriever, reranker, and generator.

A large automatic annotation process producing 34,781 QA pairs (27,824 train, 6,957 test) using WizardLM as annotator and a Cohen's Kappa = 0.67 check.

Key Findings

Adding RAG boosts retrieval and answer quality over the baseline LLM.

NumbersRecall 0.361 -> 0.409; F1 0.186 -> 0.289

Practical UseUse a retrieval step (RAG) to raise factual recall and F1 on domain queries; it's a low-risk first step before model finetuning.

Evidence RefTable 1

Fine-tuning the embedding model yields further gains.

NumbersRecall 0.409 -> 0.437; F1 0.289 -> 0.304

Practical UseFine-tune embeddings on your domain QA pairs to improve retrieval relevance; this often gives measurable gains without changing the generator.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Recall	Baseline 0.361 (±0.069); Raw RAG 0.409 (±0.081); +Emb 0.437 (±0.076); +Core 0.448 (±0.106); +Both 0.452 (±0.107)	0.361	Best 0.452 (+0.091 vs baseline)	Local human-evaluated test set (128 QA samples per run)	Table 1 reports mean and SD from 4 runs	Table 1
F1 Score	Baseline 0.186 (±0.032); Raw RAG 0.289 (±0.065); +Emb 0.304 (±0.063); +Core 0.211 (±0.056); +Both 0.219 (±0.060)	0.186	Raw RAG +0.103; +Emb +0.118; +Core +0.025	Local human-evaluated test set	Table 1 shows F1 and SD over 4 runs	Table 1

What To Try In 7 Days

Crawl your domain docs, filter noisy pages, and build a small KB.

Add off-the-shelf embeddings + a simple retriever and test RAG vs baseline on 100 domain QA samples.

Fine-tune embeddings on your QA pairs before touching generator finetuning.

Optimization Features

Model Optimization

INT4 quantization for LLaMA-2-7BLoRA

Training Optimization

Embedding fine-tune with MultipleNegativesRankingLossLoRA

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

GitHub (paper states code/models available; no URL provided in text)

Risks & Boundaries

Limitations

Training data is auto-annotated and moderately noisy (Cohen's Kappa = 0.67).

Generator finetuning used limited compute and a small 7B model; results may not scale.

When Not To Use

For open-domain or web-wide QA where a representative KB cannot be built.

When you lack a reasonably large, high-quality domain dataset for generator finetuning.

Failure Modes

Overfitting to small biased finetune data, reducing generation quality.

Repetitive or filler tokens injected from dataset formatting ("context:", "answer:").

Core Entities

Models

meta-llama/Llama-2-7b-chat-hf (LLaMA-2-7B)mixedbread-ai/mxbai-embed-large-v1 (embedder)BAAI/bge-reranker-large (BgeRerank)WizardLM (annotation model)GPT4All (annotation candidate)

Metrics

RecallF1 ScoreCosine SimilarityBLEU

Datasets

Curated CMU / LTI crawl (html + pdf + papers)Generated QA pairs (34,781 total; 27,824 train, 6,957 test)

Benchmarks

Local human-evaluated test set (random 128 QA samples per run)

Context Entities

Models

WizardLM (used for automatic annotation)GPT4All, LLaMA-2 (evaluated as annotators)

Metrics

Cohen's Kappa (annotation agreement)

Datasets

Semantic Scholar 2023 papers for LTI facultyCMU website pages filtered with CMU/LTI keywords

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding RAG boosts retrieval and answer quality over the baseline LLM.

Fine-tuning the embedding model yields further gains.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding