Hybrid RAG that cuts hallucinations and improves multi‑step reasoning via chunking, tables, tool math, and KG

August 9, 20247 min

Overview

Decision SnapshotNeeds Validation

The system uses practical, well-known components and careful prompting to reduce hallucinations; it is effective on CRAG but remains compute-heavy and needs safer execution of LLM-generated code.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Ye Yuan, Chengwu Liu, Jingyang Yuan, Gongbo Sun, Siqi Li, Ming Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.

Who Should Care

Summary TLDR

This is a practical Retrieval-Augmented Generation (RAG) pipeline tuned for the CRAG benchmark. The authors add careful web-page chunking, table-to-Markdown extraction, an attribute predictor (static vs dynamic), an LLM-based knowledge extractor (recitation), a Python-based numerical calculator, a KG query module, and a constrained reasoning prompt that refuses risky answers. On the CRAG public test their full system reduces hallucination from 72.1% to 13.9% and shifts the benchmark score from -46.6% (official RAG baseline) to 15.8%, while improving correct answers modestly (25.4% → 29.7%). Code is released.

Problem Statement

RAG systems can reduce LLM hallucinations and add time-sensitive knowledge, but off-the-shelf pipelines still fail on multi-hop, numerical, table-based and time-varying questions. The paper aims to assemble practical modules that boost retrieval quality, numeric correctness, and multi-step reasoning on the CRAG benchmark within contest constraints.

Main Contribution

A hybrid RAG pipeline that blends web retrieval, table extraction, KG queries, and LLM recitation as references.

Robust web-page processing: noise removal, sentence chunking, and dedicated table-to-Markdown extraction.

Key Findings

Full system flipped the Task 1 public score from strongly negative to positive.

NumbersScore: Our 15.8% vs Official RAG -46.6% (Table 1)

Practical UseCombining the modules can change a failing RAG baseline into a useful system on CRAG-style QA.

Evidence RefTable 1

Hallucination rate dropped substantially after system changes.

NumbersHallucination: 72.1%13.9% (Table 1)

Practical UseUse attribute prediction, refusal policy, and constrained reasoning to convert many wrong answers into safe 'I don't know' outputs.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Score(%) on public Task 115.8%-46.6% (Official RAG)+62.4 ppCRAG public test (Task 1)Table 1 shows Official RAG -46.6% vs Our 15.8%Table 1
Hallucination(%)13.9%72.1% (Official RAG)-58.2 ppCRAG public test (Task 1)Table 1 numbersTable 1

What To Try In 7 Days

Add robust web cleaning + chunking (trafilatura + sentence segmentation).

Extract tables to Markdown and include them as separate references.

Add a static/dynamic classifier and refuse dynamic questions with 'I don't know'. Modify later when live data is available. (ICL or SVM).</li></li></li></li></li></li></li></li></l

Agent Features

Memory
retrieval memory (external corpus + KG)
Tool Use
Python numeric evaluator (exec/eval)function-calling style for KG queries (attempted)
Frameworks
RAG pipelineLLM prompting and constrained output
Architectures
retriever-generator pipelinetwo-tower retrieval (embedding + cosine)

Optimization Features

Token Efficiency
chunking and table truncation to limit context
Model Optimization
GPTQ quantized Llama3-70B for inference
System Optimization
backup summarization agent for parse failures
Inference Optimization
refuse dynamic queries to save time and avoid hallucinationuse efficient two-tower embeddings and cosine similarity

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

CRAG public split (Meta CRAG KDD Cup 2024)

Risks & Boundaries

Limitations

Poor handling of time-varying (dynamic) questions by design; many are refused.

Knowledge-graph query module was under-optimized in submitted version.

When Not To Use

In latency- or cost-sensitive environments without heavy GPU resources.

For applications that require live, real-time answers to dynamic questions.

Failure Modes

False refusal: safe dynamic questions may be needlessly answered with 'I don't know'.

Malicious or unstable code from LLM calculator causing crashes if not sandboxed.

Core Entities

Models

sentence-t5-largeLlama3-70B-InstructLlama3-70B-GPTQall-MiniLM-L6-v2

Metrics

Correct(%)Missing(%)Hallucination(%)Score(%)

Datasets

CRAG (Meta CRAG KDD Cup 2024 public split)

Benchmarks

CRAG benchmark