Overview
The system uses practical, well-known components and careful prompting to reduce hallucinations; it is effective on CRAG but remains compute-heavy and needs safer execution of LLM-generated code.
Citations2
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.
Who Should Care
Summary TLDR
This is a practical Retrieval-Augmented Generation (RAG) pipeline tuned for the CRAG benchmark. The authors add careful web-page chunking, table-to-Markdown extraction, an attribute predictor (static vs dynamic), an LLM-based knowledge extractor (recitation), a Python-based numerical calculator, a KG query module, and a constrained reasoning prompt that refuses risky answers. On the CRAG public test their full system reduces hallucination from 72.1% to 13.9% and shifts the benchmark score from -46.6% (official RAG baseline) to 15.8%, while improving correct answers modestly (25.4% → 29.7%). Code is released.
Problem Statement
RAG systems can reduce LLM hallucinations and add time-sensitive knowledge, but off-the-shelf pipelines still fail on multi-hop, numerical, table-based and time-varying questions. The paper aims to assemble practical modules that boost retrieval quality, numeric correctness, and multi-step reasoning on the CRAG benchmark within contest constraints.
Main Contribution
A hybrid RAG pipeline that blends web retrieval, table extraction, KG queries, and LLM recitation as references.
Robust web-page processing: noise removal, sentence chunking, and dedicated table-to-Markdown extraction.
Key Findings
Full system flipped the Task 1 public score from strongly negative to positive.
Hallucination rate dropped substantially after system changes.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Score(%) on public Task 1 | 15.8% | -46.6% (Official RAG) | +62.4 pp | CRAG public test (Task 1) | Table 1 shows Official RAG -46.6% vs Our 15.8% | Table 1 |
| Hallucination(%) | 13.9% | 72.1% (Official RAG) | -58.2 pp | CRAG public test (Task 1) | Table 1 numbers | Table 1 |
What To Try In 7 Days
Add robust web cleaning + chunking (trafilatura + sentence segmentation).
Extract tables to Markdown and include them as separate references.
Add a static/dynamic classifier and refuse dynamic questions with 'I don't know'. Modify later when live data is available. (ICL or SVM).</li></li></li></li></li></li></li></li></l
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Poor handling of time-varying (dynamic) questions by design; many are refused.
Knowledge-graph query module was under-optimized in submitted version.
When Not To Use
In latency- or cost-sensitive environments without heavy GPU resources.
For applications that require live, real-time answers to dynamic questions.
Failure Modes
False refusal: safe dynamic questions may be needlessly answered with 'I don't know'.
Malicious or unstable code from LLM calculator causing crashes if not sandboxed.

