Hybrid RAG that cuts hallucinations and improves multi‑step reasoning via chunking, tables, tool math, and KG

August 9, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

2

Authors

Ye Yuan, Chengwu Liu, Jingyang Yuan, Gongbo Sun, Siqi Li, Ming Zhang

Links

Abstract / PDF

Why It Matters For Business

A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.

Summary TLDR

This is a practical Retrieval-Augmented Generation (RAG) pipeline tuned for the CRAG benchmark. The authors add careful web-page chunking, table-to-Markdown extraction, an attribute predictor (static vs dynamic), an LLM-based knowledge extractor (recitation), a Python-based numerical calculator, a KG query module, and a constrained reasoning prompt that refuses risky answers. On the CRAG public test their full system reduces hallucination from 72.1% to 13.9% and shifts the benchmark score from -46.6% (official RAG baseline) to 15.8%, while improving correct answers modestly (25.4% → 29.7%). Code is released.

Problem Statement

RAG systems can reduce LLM hallucinations and add time-sensitive knowledge, but off-the-shelf pipelines still fail on multi-hop, numerical, table-based and time-varying questions. The paper aims to assemble practical modules that boost retrieval quality, numeric correctness, and multi-step reasoning on the CRAG benchmark within contest constraints.

Main Contribution

A hybrid RAG pipeline that blends web retrieval, table extraction, KG queries, and LLM recitation as references.

Robust web-page processing: noise removal, sentence chunking, and dedicated table-to-Markdown extraction.

Attribute predictor (few-shot ICL) to label questions as static vs dynamic and force safe refusals for dynamic items.

Numerical calculator: LLM emits Python expressions that are executed externally to avoid numeric hallucination.

Constrained reasoning prompts (zero-shot CoT + format enforcement) that output stepwise reasoning and 'I don't know' or 'Invalid question' when appropriate.

Practical ablation sequence showing incremental gains from each module on CRAG Task 1/2.

Key Findings

Full system flipped the Task 1 public score from strongly negative to positive.

NumbersScore: Our 15.8% vs Official RAG -46.6% (Table 1)

Hallucination rate dropped substantially after system changes.

NumbersHallucination: 72.1% → 13.9% (Table 1)

Correct-answer rate rose modestly with the final pipeline.

NumbersCorrect: 25.4% → 29.7% (Table 1)

Results

Score(%) on public Task 1

Value15.8%

Baseline-46.6% (Official RAG)

Hallucination(%)

Value13.9%

Baseline72.1% (Official RAG)

Correct(%)

Value29.7%

Baseline25.4% (Official RAG)

Who Should Care

What To Try In 7 Days

Add robust web cleaning + chunking (trafilatura + sentence segmentation).

Extract tables to Markdown and include them as separate references.

Add a static/dynamic classifier and refuse dynamic questions with 'I don't know'. Modify later when live data is available. (ICL or SVM).</li></li></li></li></li></li></li></li></l

Agent Features

Memory

  • retrieval memory (external corpus + KG)

Tool Use

  • Python numeric evaluator (exec/eval)
  • function-calling style for KG queries (attempted)

Frameworks

  • RAG pipeline
  • LLM prompting and constrained output

Architectures

  • retriever-generator pipeline
  • two-tower retrieval (embedding + cosine)

Optimization Features

Token Efficiency

  • chunking and table truncation to limit context

Model Optimization

  • GPTQ quantized Llama3-70B for inference

System Optimization

  • backup summarization agent for parse failures

Inference Optimization

  • refuse dynamic queries to save time and avoid hallucination
  • use efficient two-tower embeddings and cosine similarity

Reproducibility

Data Urls

  • CRAG public split (Meta CRAG KDD Cup 2024)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Poor handling of time-varying (dynamic) questions by design; many are refused.
  • Knowledge-graph query module was under-optimized in submitted version.
  • Table handling is basic; large or noisy tables can still degrade results.
  • Security: executing LLM-generated Python expressions without sandboxing is risky.

When Not To Use

  • In latency- or cost-sensitive environments without heavy GPU resources.
  • For applications that require live, real-time answers to dynamic questions.
  • When safe sandboxed code execution is unavailable.

Failure Modes

  • False refusal: safe dynamic questions may be needlessly answered with 'I don't know'.
  • Malicious or unstable code from LLM calculator causing crashes if not sandboxed.
  • Over-reliance on LLM-recited knowledge that can hallucinate and mislead reasoning.
  • Parsing or format failures when constrained-output sampling is not enforced.

Core Entities

Models

  • sentence-t5-large
  • Llama3-70B-Instruct
  • Llama3-70B-GPTQ
  • all-MiniLM-L6-v2

Metrics

  • Correct(%)
  • Missing(%)
  • Hallucination(%)
  • Score(%)

Datasets

  • CRAG (Meta CRAG KDD Cup 2024 public split)

Benchmarks

  • CRAG benchmark