Using a targeted RAG pipeline and curated CMU dataset to reduce LLM hallucinations on domain queries

March 15, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

19

Authors

Jiarui Li, Ye Yuan, Zehua Zhang

Links

Abstract / PDF

Why It Matters For Business

Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.

Summary TLDR

The authors build a complete Retrieval-Augmented Generation (RAG) QA system over a curated CMU/LTI knowledge base to reduce hallucinations on domain and time-sensitive queries. They crawl CMU sites, generate 34,781 QA pairs with an LLM annotator, and evaluate variants: baseline LLM, RAG, embedding fine-tune, generator fine-tune, and both. RAG improves recall and F1 vs baseline; embedding fine-tuning gives extra gains; generator finetuning raises recall but can hurt F1 when the finetune data is small and biased. Code/models are claimed available on GitHub.

Problem Statement

Off-the-shelf LLMs often hallucinate on domain-specific or time-sensitive questions. The paper asks whether adding a curated, private knowledge base plus a RAG pipeline and targeted fine-tuning improves factual accuracy for CMU/LTI queries.

Main Contribution

A complete RAG QA system built over a crawled CMU / LTI knowledge base, including crawler, storage, retriever, reranker, and generator.

A large automatic annotation process producing 34,781 QA pairs (27,824 train, 6,957 test) using WizardLM as annotator and a Cohen's Kappa = 0.67 check.

An ablation evaluation showing RAG and embedding finetuning improve factual metrics, while generator finetuning on a small biased dataset can hurt answer quality.

Key Findings

Adding RAG boosts retrieval and answer quality over the baseline LLM.

NumbersRecall 0.361 -> 0.409; F1 0.186 -> 0.289

Fine-tuning the embedding model yields further gains.

NumbersRecall 0.409 -> 0.437; F1 0.289 -> 0.304

Fine-tuning the generator increased recall but lowered F1 and fluency in some cases.

NumbersCore finetune: Recall 0.448, F1 0.211; Combined: Recall 0.452, F1 0.219

The curated dataset size and annotation quality are moderate.

Numbers34,781 QA pairs; Cohen's Kappa = 0.67

Results

Recall

ValueBaseline 0.361 (±0.069); Raw RAG 0.409 (±0.081); +Emb 0.437 (±0.076); +Core 0.448 (±0.106); +Both 0.452 (±0.107)

Baseline0.361

F1 Score

ValueBaseline 0.186 (±0.032); Raw RAG 0.289 (±0.065); +Emb 0.304 (±0.063); +Core 0.211 (±0.056); +Both 0.219 (±0.060)

Baseline0.186

BLEU

ValueBaseline 0.043; Raw RAG 0.102; +Emb 0.108; +Core 0.056; +Both 0.060

Baseline0.043

Who Should Care

What To Try In 7 Days

Crawl your domain docs, filter noisy pages, and build a small KB.

Add off-the-shelf embeddings + a simple retriever and test RAG vs baseline on 100 domain QA samples.

Fine-tune embeddings on your QA pairs before touching generator finetuning.

Optimization Features

Model Optimization

  • INT4 quantization for LLaMA-2-7B
  • LoRA

Training Optimization

  • Embedding fine-tune with MultipleNegativesRankingLoss
  • LoRA

Reproducibility

Code Urls

  • GitHub (paper states code/models available; no URL provided in text)

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training data is auto-annotated and moderately noisy (Cohen's Kappa = 0.67).
  • Generator finetuning used limited compute and a small 7B model; results may not scale.
  • Dataset is domain-specific and biased toward CMU/LTI content.
  • Fine-tuned generator produced verbose or templated outputs in examples.

When Not To Use

  • For open-domain or web-wide QA where a representative KB cannot be built.
  • When you lack a reasonably large, high-quality domain dataset for generator finetuning.
  • When strict brevity or fluency is more important than recall without further validation.

Failure Modes

  • Overfitting to small biased finetune data, reducing generation quality.
  • Repetitive or filler tokens injected from dataset formatting ("context:", "answer:").
  • Remaining hallucinations when retriever returns irrelevant chunks.

Core Entities

Models

  • meta-llama/Llama-2-7b-chat-hf (LLaMA-2-7B)
  • mixedbread-ai/mxbai-embed-large-v1 (embedder)
  • BAAI/bge-reranker-large (BgeRerank)
  • WizardLM (annotation model)
  • GPT4All (annotation candidate)

Metrics

  • Recall
  • F1 Score
  • Cosine Similarity
  • BLEU

Datasets

  • Curated CMU / LTI crawl (html + pdf + papers)
  • Generated QA pairs (34,781 total; 27,824 train, 6,957 test)

Benchmarks

  • Local human-evaluated test set (random 128 QA samples per run)

Context Entities

Models

  • WizardLM (used for automatic annotation)
  • GPT4All, LLaMA-2 (evaluated as annotators)

Metrics

  • Cohen's Kappa (annotation agreement)

Datasets

  • Semantic Scholar 2023 papers for LTI faculty
  • CMU website pages filtered with CMU/LTI keywords