Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

February 26, 20257 min

Overview

Decision SnapshotNeeds Validation

The dataset and model training are practical and tested across many baselines; numbers come from a single benchmark family and planned releases improve reproducibility but some claims depend on the authors' synthetic pipelines and limited ablations.

Citations0

Evidence Strength0.70

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can run a small, tuned judge model to flag unsupported output from RAG systems across English and Chinese, cutting API costs and enabling on-premise monitoring without losing much accuracy on typical tasks.

Who Should Care

Summary TLDR

This paper builds Bi'anBench, a bilingual (English+Chinese) benchmark of 22,992 RAG test cases covering QA, summarization, data-to-text and machine translation. It also trains compact judge models (Bi'an-qwen 7B and 14B) via supervised fine-tuning plus Direct Preference Optimization (DPO) using ensemble-generated training signals. The 14B judge matches or slightly beats much larger open models on this benchmark and narrows the gap to GPT-4o, while smaller 7B models outperform some mid-sized baselines. The authors release datasets, prompts, and training recipes and highlight failure modes like parametric vs. context knowledge conflicts.

Problem Statement

RAG pipelines reduce hallucinations but still produce unsupported or contradictory content. Practitioners lack a multilingual, multi-task benchmark and lightweight judge models optimized for RAG hallucination detection, forcing reliance on expensive closed-source models.

Main Contribution

Bi'anBench — a bilingual (EN+ZH) RAG hallucination detection benchmark with 22,992 labeled test cases across QA, summarization, data-to-text, and machine translation.

A data generation process using two GPT-4o-based pipelines: a hallucination perturbation pipeline and a counterfactual-QA pipeline.

Key Findings

Bi'anBench is a large bilingual benchmark for RAG hallucination detection.

Numbers22,992 total cases (EN 13,301; ZH 7,757; CF 1,934)

Practical UseUse this dataset to test RAG faithfulness across languages and four tasks before deploying a judge or RAG pipeline.

Evidence RefTable 9; Section 2.2

Compact judge models trained on Bi'an data nearly match much larger open models on the benchmark.

NumbersBi'an-qwen-14B avg ≈ 83.4 vs Qwen2.5-72B 83.3 and GPT-4o-0806 84.8 (on evaluated subsets)

Practical UseFine-tuned 14B open models can replace larger open models for RAG-judge tasks to cut cost while keeping similar accuracy on these tests.

Evidence RefTable 17; Section 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Bi'anBench size22,992 cases (EN 13,301; ZH 7,757; CF 1,934)Bi'anBenchTable 9; Section 2.2Table 9
Training samples constructed5,994 SFT samples and 1,713 preference pairstraining dataSection 2.3; B.1; Table 13B.1 Table 13

What To Try In 7 Days

Run Bi'anBench on your RAG system to baseline hallucination detection performance in English and Chinese.

Fine-tune an open 7B judge model on a subset of your RAG data using the paper's ensemble + SFT recipe and measure accuracy vs your current detector.

Run targeted counterfactual/context-conflict tests to see if your detector mistakes model knowledge for context evidence.

Optimization Features

System Optimization
DeepSpeed for distributed training
Training Optimization
LoRASFT

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Training sample loss: samples where all ensemble models err were discarded, biasing training away from hard cases.

Task coverage excludes creative writing; subjective tasks are not represented.

When Not To Use

Detecting hallucinations in creative or highly subjective writing.

Tasks needing high-precision numerical computation or deep long-context reasoning where Bi'an models underperform GPT-4o.

Failure Modes

Parametric knowledge vs context conflicts causing wrong judgments.

Loss of difficult training samples due to ensemble construction rules.

Core Entities

Models

Bi'an-qwen-7BBi'an-qwen-14BQwen2.5-7B-InstructQwen2.5-14B-InstructQwen2.5-72B-InstructQwen2-7B-InstructGPT-4o-0806GPT-4o-miniLlama3.1-8B-InstructLlama3.1-70B-InstructLynx-8B-v1.1

Metrics

Accuracy

Datasets

Bi'anBenchBi'anBench_ENBi'anBench_ZHBi'anBench_CFHaluEvalRAGTruthHaluBenchASQAIfQAWebNLGWMT21PDCFinanceBenchDROPPubMedQACovidQACRUDWebQA1.0LawBenchCSDS

Benchmarks

Bi'anBenchHaluEvalRAGTruthHaluBench