Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Overview

Decision SnapshotNeeds Validation

The dataset and model training are practical and tested across many baselines; numbers come from a single benchmark family and planned releases improve reproducibility but some claims depend on the authors' synthetic pipelines and limited ablations.

Citations0

Evidence Strength0.70

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can run a small, tuned judge model to flag unsupported output from RAG systems across English and Chinese, cutting API costs and enabling on-premise monitoring without losing much accuracy on typical tasks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

This paper builds Bi'anBench, a bilingual (English+Chinese) benchmark of 22,992 RAG test cases covering QA, summarization, data-to-text and machine translation. It also trains compact judge models (Bi'an-qwen 7B and 14B) via supervised fine-tuning plus Direct Preference Optimization (DPO) using ensemble-generated training signals. The 14B judge matches or slightly beats much larger open models on this benchmark and narrows the gap to GPT-4o, while smaller 7B models outperform some mid-sized baselines. The authors release datasets, prompts, and training recipes and highlight failure modes like parametric vs. context knowledge conflicts.

Problem Statement

RAG pipelines reduce hallucinations but still produce unsupported or contradictory content. Practitioners lack a multilingual, multi-task benchmark and lightweight judge models optimized for RAG hallucination detection, forcing reliance on expensive closed-source models.

Main Contribution

Bi'anBench — a bilingual (EN+ZH) RAG hallucination detection benchmark with 22,992 labeled test cases across QA, summarization, data-to-text, and machine translation.

A data generation process using two GPT-4o-based pipelines: a hallucination perturbation pipeline and a counterfactual-QA pipeline.

Key Findings

Bi'anBench is a large bilingual benchmark for RAG hallucination detection.

Numbers22,992 total cases (EN 13,301; ZH 7,757; CF 1,934)

Practical UseUse this dataset to test RAG faithfulness across languages and four tasks before deploying a judge or RAG pipeline.

Evidence RefTable 9; Section 2.2

Compact judge models trained on Bi'an data nearly match much larger open models on the benchmark.

NumbersBi'an-qwen-14B avg ≈ 83.4 vs Qwen2.5-72B 83.3 and GPT-4o-0806 84.8 (on evaluated subsets)

Practical UseFine-tuned 14B open models can replace larger open models for RAG-judge tasks to cut cost while keeping similar accuracy on these tests.

Evidence RefTable 17; Section 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Bi'anBench size	22,992 cases (EN 13,301; ZH 7,757; CF 1,934)	—	—	Bi'anBench	Table 9; Section 2.2	Table 9
Training samples constructed	5,994 SFT samples and 1,713 preference pairs	—	—	training data	Section 2.3; B.1; Table 13	B.1 Table 13

What To Try In 7 Days

Run Bi'anBench on your RAG system to baseline hallucination detection performance in English and Chinese.

Fine-tune an open 7B judge model on a subset of your RAG data using the paper's ensemble + SFT recipe and measure accuracy vs your current detector.

Run targeted counterfactual/context-conflict tests to see if your detector mistakes model knowledge for context evidence.

Optimization Features

System Optimization

DeepSpeed for distributed training

Training Optimization

LoRASFT

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/OpenSPG/KAG

Data URLs

https://github.com/OpenSPG/KAG

Risks & Boundaries

Limitations

Training sample loss: samples where all ensemble models err were discarded, biasing training away from hard cases.

Task coverage excludes creative writing; subjective tasks are not represented.

When Not To Use

Detecting hallucinations in creative or highly subjective writing.

Tasks needing high-precision numerical computation or deep long-context reasoning where Bi'an models underperform GPT-4o.

Failure Modes

Parametric knowledge vs context conflicts causing wrong judgments.

Loss of difficult training samples due to ensemble construction rules.

Core Entities

Models

Bi'an-qwen-7BBi'an-qwen-14BQwen2.5-7B-InstructQwen2.5-14B-InstructQwen2.5-72B-InstructQwen2-7B-InstructGPT-4o-0806GPT-4o-miniLlama3.1-8B-InstructLlama3.1-70B-InstructLynx-8B-v1.1

Metrics

Accuracy

Datasets

Bi'anBenchBi'anBench_ENBi'anBench_ZHBi'anBench_CFHaluEvalRAGTruthHaluBenchASQAIfQAWebNLGWMT21PDCFinanceBenchDROPPubMedQACovidQACRUDWebQA1.0LawBenchCSDS

Benchmarks

Bi'anBenchHaluEvalRAGTruthHaluBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Bi'anBench is a large bilingual benchmark for RAG hallucination detection.

Compact judge models trained on Bi'an data nearly match much larger open models on the benchmark.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding