Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can run a small, tuned judge model to flag unsupported output from RAG systems across English and Chinese, cutting API costs and enabling on-premise monitoring without losing much accuracy on typical tasks.
Summary TLDR
This paper builds Bi'anBench, a bilingual (English+Chinese) benchmark of 22,992 RAG test cases covering QA, summarization, data-to-text and machine translation. It also trains compact judge models (Bi'an-qwen 7B and 14B) via supervised fine-tuning plus Direct Preference Optimization (DPO) using ensemble-generated training signals. The 14B judge matches or slightly beats much larger open models on this benchmark and narrows the gap to GPT-4o, while smaller 7B models outperform some mid-sized baselines. The authors release datasets, prompts, and training recipes and highlight failure modes like parametric vs. context knowledge conflicts.
Problem Statement
RAG pipelines reduce hallucinations but still produce unsupported or contradictory content. Practitioners lack a multilingual, multi-task benchmark and lightweight judge models optimized for RAG hallucination detection, forcing reliance on expensive closed-source models.
Main Contribution
Bi'anBench — a bilingual (EN+ZH) RAG hallucination detection benchmark with 22,992 labeled test cases across QA, summarization, data-to-text, and machine translation.
A data generation process using two GPT-4o-based pipelines: a hallucination perturbation pipeline and a counterfactual-QA pipeline.
Training recipe for compact judge models (Bi'an-qwen-7B and -14B): ensemble-based sample construction, supervised fine-tuning (SFT) + DPO, using LoRA and DeepSpeed.
Evaluation showing Bi'an-qwen-14B matches or slightly outperforms large open baselines on Bi'anBench and narrows the gap to closed-source GPT-4o.
Analysis of failure modes, notably conflicts between model parametric knowledge and provided context (43.9% of annotated GPT-4o errors on a counterfactual subset).
Key Findings
Bi'anBench is a large bilingual benchmark for RAG hallucination detection.
Compact judge models trained on Bi'an data nearly match much larger open models on the benchmark.
Smaller Bi'an model beats mid-sized closed models on many cases.
Parametric knowledge conflicts harm hallucination detection decisions.
Training gains come mostly from supervised fine-tuning, with DPO providing smaller incremental gains.
Results
Bi'anBench size
Training samples constructed
Accuracy
Counterfactual QA subset (Bi'anBench_CF)
Parametric knowledge errors (GPT-4o)
Who Should Care
What To Try In 7 Days
Run Bi'anBench on your RAG system to baseline hallucination detection performance in English and Chinese.
Fine-tune an open 7B judge model on a subset of your RAG data using the paper's ensemble + SFT recipe and measure accuracy vs your current detector.
Run targeted counterfactual/context-conflict tests to see if your detector mistakes model knowledge for context evidence.
Optimization Features
System Optimization
- DeepSpeed for distributed training
Training Optimization
- LoRA
- SFT
Reproducibility
Code Urls
Data Urls
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training sample loss: samples where all ensemble models err were discarded, biasing training away from hard cases.
- Task coverage excludes creative writing; subjective tasks are not represented.
- Models still lag GPT-4o on numerical computation and some long-context tasks.
- Planned data/model release was not available at time of writing (release pending).
When Not To Use
- Detecting hallucinations in creative or highly subjective writing.
- Tasks needing high-precision numerical computation or deep long-context reasoning where Bi'an models underperform GPT-4o.
- Use-cases that require fully verified open-source release before deployment.
Failure Modes
- Parametric knowledge vs context conflicts causing wrong judgments.
- Loss of difficult training samples due to ensemble construction rules.
- Synthetic perturbations may not cover all real-world hallucination patterns.
Core Entities
Models
- Bi'an-qwen-7B
- Bi'an-qwen-14B
- Qwen2.5-7B-Instruct
- Qwen2.5-14B-Instruct
- Qwen2.5-72B-Instruct
- Qwen2-7B-Instruct
- GPT-4o-0806
- GPT-4o-mini
- Llama3.1-8B-Instruct
- Llama3.1-70B-Instruct
- Lynx-8B-v1.1
Metrics
- Accuracy
Datasets
- Bi'anBench
- Bi'anBench_EN
- Bi'anBench_ZH
- Bi'anBench_CF
- HaluEval
- RAGTruth
- HaluBench
- ASQA
- IfQA
- WebNLG
- WMT21
- PDC
- FinanceBench
- DROP
- PubMedQA
- CovidQA
- CRUD
- WebQA1.0
- LawBench
- CSDS
Benchmarks
- Bi'anBench
- HaluEval
- RAGTruth
- HaluBench

