Hybrid RAG that cuts hallucinations and improves multi‑step reasoning via chunking, tables, tool math, and KG

Overview

Decision SnapshotNeeds Validation

The system uses practical, well-known components and careful prompting to reduce hallucinations; it is effective on CRAG but remains compute-heavy and needs safer execution of LLM-generated code.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Ye Yuan, Chengwu Liu, Jingyang Yuan, Gongbo Sun, Siqi Li, Ming Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This is a practical Retrieval-Augmented Generation (RAG) pipeline tuned for the CRAG benchmark. The authors add careful web-page chunking, table-to-Markdown extraction, an attribute predictor (static vs dynamic), an LLM-based knowledge extractor (recitation), a Python-based numerical calculator, a KG query module, and a constrained reasoning prompt that refuses risky answers. On the CRAG public test their full system reduces hallucination from 72.1% to 13.9% and shifts the benchmark score from -46.6% (official RAG baseline) to 15.8%, while improving correct answers modestly (25.4% → 29.7%). Code is released.

Problem Statement

RAG systems can reduce LLM hallucinations and add time-sensitive knowledge, but off-the-shelf pipelines still fail on multi-hop, numerical, table-based and time-varying questions. The paper aims to assemble practical modules that boost retrieval quality, numeric correctness, and multi-step reasoning on the CRAG benchmark within contest constraints.

Main Contribution

A hybrid RAG pipeline that blends web retrieval, table extraction, KG queries, and LLM recitation as references.

Robust web-page processing: noise removal, sentence chunking, and dedicated table-to-Markdown extraction.

Key Findings

Full system flipped the Task 1 public score from strongly negative to positive.

NumbersScore: Our 15.8% vs Official RAG -46.6% (Table 1)

Practical UseCombining the modules can change a failing RAG baseline into a useful system on CRAG-style QA.

Evidence RefTable 1

Hallucination rate dropped substantially after system changes.

NumbersHallucination: 72.1% → 13.9% (Table 1)

Practical UseUse attribute prediction, refusal policy, and constrained reasoning to convert many wrong answers into safe 'I don't know' outputs.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Score(%) on public Task 1	15.8%	-46.6% (Official RAG)	+62.4 pp	CRAG public test (Task 1)	Table 1 shows Official RAG -46.6% vs Our 15.8%	Table 1
Hallucination(%)	13.9%	72.1% (Official RAG)	-58.2 pp	CRAG public test (Task 1)	Table 1 numbers	Table 1

What To Try In 7 Days

Add robust web cleaning + chunking (trafilatura + sentence segmentation).

Extract tables to Markdown and include them as separate references.

Add a static/dynamic classifier and refuse dynamic questions with 'I don't know'. Modify later when live data is available. (ICL or SVM).</li></li></li></li></li></li></li></li></l

Agent Features

Memory

retrieval memory (external corpus + KG)

Tool Use

Python numeric evaluator (exec/eval)function-calling style for KG queries (attempted)

Frameworks

RAG pipelineLLM prompting and constrained output

Architectures

retriever-generator pipelinetwo-tower retrieval (embedding + cosine)

Optimization Features

Token Efficiency

chunking and table truncation to limit context

Model Optimization

GPTQ quantized Llama3-70B for inference

System Optimization

backup summarization agent for parse failures

Inference Optimization

refuse dynamic queries to save time and avoid hallucinationuse efficient two-tower embeddings and cosine similarity

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://gitlab.aicrowd.com/shizueyy/crag-new

Data URLs

CRAG public split (Meta CRAG KDD Cup 2024)

Risks & Boundaries

Limitations

Poor handling of time-varying (dynamic) questions by design; many are refused.

Knowledge-graph query module was under-optimized in submitted version.

When Not To Use

In latency- or cost-sensitive environments without heavy GPU resources.

For applications that require live, real-time answers to dynamic questions.

Failure Modes

False refusal: safe dynamic questions may be needlessly answered with 'I don't know'.

Malicious or unstable code from LLM calculator causing crashes if not sandboxed.

Core Entities

Models

sentence-t5-largeLlama3-70B-InstructLlama3-70B-GPTQall-MiniLM-L6-v2

Metrics

Correct(%)Missing(%)Hallucination(%)Score(%)

Datasets

CRAG (Meta CRAG KDD Cup 2024 public split)

Benchmarks

CRAG benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Full system flipped the Task 1 public score from strongly negative to positive.

Hallucination rate dropped substantially after system changes.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f