Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.
Summary TLDR
This is a practical Retrieval-Augmented Generation (RAG) pipeline tuned for the CRAG benchmark. The authors add careful web-page chunking, table-to-Markdown extraction, an attribute predictor (static vs dynamic), an LLM-based knowledge extractor (recitation), a Python-based numerical calculator, a KG query module, and a constrained reasoning prompt that refuses risky answers. On the CRAG public test their full system reduces hallucination from 72.1% to 13.9% and shifts the benchmark score from -46.6% (official RAG baseline) to 15.8%, while improving correct answers modestly (25.4% → 29.7%). Code is released.
Problem Statement
RAG systems can reduce LLM hallucinations and add time-sensitive knowledge, but off-the-shelf pipelines still fail on multi-hop, numerical, table-based and time-varying questions. The paper aims to assemble practical modules that boost retrieval quality, numeric correctness, and multi-step reasoning on the CRAG benchmark within contest constraints.
Main Contribution
A hybrid RAG pipeline that blends web retrieval, table extraction, KG queries, and LLM recitation as references.
Robust web-page processing: noise removal, sentence chunking, and dedicated table-to-Markdown extraction.
Attribute predictor (few-shot ICL) to label questions as static vs dynamic and force safe refusals for dynamic items.
Numerical calculator: LLM emits Python expressions that are executed externally to avoid numeric hallucination.
Constrained reasoning prompts (zero-shot CoT + format enforcement) that output stepwise reasoning and 'I don't know' or 'Invalid question' when appropriate.
Practical ablation sequence showing incremental gains from each module on CRAG Task 1/2.
Key Findings
Full system flipped the Task 1 public score from strongly negative to positive.
Hallucination rate dropped substantially after system changes.
Correct-answer rate rose modestly with the final pipeline.
Results
Score(%) on public Task 1
Hallucination(%)
Correct(%)
Who Should Care
What To Try In 7 Days
Add robust web cleaning + chunking (trafilatura + sentence segmentation).
Extract tables to Markdown and include them as separate references.
Add a static/dynamic classifier and refuse dynamic questions with 'I don't know'. Modify later when live data is available. (ICL or SVM).</li></li></li></li></li></li></li></li></l
Agent Features
Memory
- retrieval memory (external corpus + KG)
Tool Use
- Python numeric evaluator (exec/eval)
- function-calling style for KG queries (attempted)
Frameworks
- RAG pipeline
- LLM prompting and constrained output
Architectures
- retriever-generator pipeline
- two-tower retrieval (embedding + cosine)
Optimization Features
Token Efficiency
- chunking and table truncation to limit context
Model Optimization
- GPTQ quantized Llama3-70B for inference
System Optimization
- backup summarization agent for parse failures
Inference Optimization
- refuse dynamic queries to save time and avoid hallucination
- use efficient two-tower embeddings and cosine similarity
Reproducibility
Data Urls
- CRAG public split (Meta CRAG KDD Cup 2024)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Poor handling of time-varying (dynamic) questions by design; many are refused.
- Knowledge-graph query module was under-optimized in submitted version.
- Table handling is basic; large or noisy tables can still degrade results.
- Security: executing LLM-generated Python expressions without sandboxing is risky.
When Not To Use
- In latency- or cost-sensitive environments without heavy GPU resources.
- For applications that require live, real-time answers to dynamic questions.
- When safe sandboxed code execution is unavailable.
Failure Modes
- False refusal: safe dynamic questions may be needlessly answered with 'I don't know'.
- Malicious or unstable code from LLM calculator causing crashes if not sandboxed.
- Over-reliance on LLM-recited knowledge that can hallucinate and mislead reasoning.
- Parsing or format failures when constrained-output sampling is not enforced.
Core Entities
Models
- sentence-t5-large
- Llama3-70B-Instruct
- Llama3-70B-GPTQ
- all-MiniLM-L6-v2
Metrics
- Correct(%)
- Missing(%)
- Hallucination(%)
- Score(%)
Datasets
- CRAG (Meta CRAG KDD Cup 2024 public split)
Benchmarks
- CRAG benchmark

