Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

February 26, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang

Links

Abstract / PDF

Why It Matters For Business

You can run a small, tuned judge model to flag unsupported output from RAG systems across English and Chinese, cutting API costs and enabling on-premise monitoring without losing much accuracy on typical tasks.

Summary TLDR

This paper builds Bi'anBench, a bilingual (English+Chinese) benchmark of 22,992 RAG test cases covering QA, summarization, data-to-text and machine translation. It also trains compact judge models (Bi'an-qwen 7B and 14B) via supervised fine-tuning plus Direct Preference Optimization (DPO) using ensemble-generated training signals. The 14B judge matches or slightly beats much larger open models on this benchmark and narrows the gap to GPT-4o, while smaller 7B models outperform some mid-sized baselines. The authors release datasets, prompts, and training recipes and highlight failure modes like parametric vs. context knowledge conflicts.

Problem Statement

RAG pipelines reduce hallucinations but still produce unsupported or contradictory content. Practitioners lack a multilingual, multi-task benchmark and lightweight judge models optimized for RAG hallucination detection, forcing reliance on expensive closed-source models.

Main Contribution

Bi'anBench — a bilingual (EN+ZH) RAG hallucination detection benchmark with 22,992 labeled test cases across QA, summarization, data-to-text, and machine translation.

A data generation process using two GPT-4o-based pipelines: a hallucination perturbation pipeline and a counterfactual-QA pipeline.

Training recipe for compact judge models (Bi'an-qwen-7B and -14B): ensemble-based sample construction, supervised fine-tuning (SFT) + DPO, using LoRA and DeepSpeed.

Evaluation showing Bi'an-qwen-14B matches or slightly outperforms large open baselines on Bi'anBench and narrows the gap to closed-source GPT-4o.

Analysis of failure modes, notably conflicts between model parametric knowledge and provided context (43.9% of annotated GPT-4o errors on a counterfactual subset).

Key Findings

Bi'anBench is a large bilingual benchmark for RAG hallucination detection.

Numbers22,992 total cases (EN 13,301; ZH 7,757; CF 1,934)

Compact judge models trained on Bi'an data nearly match much larger open models on the benchmark.

NumbersBi'an-qwen-14B avg ≈ 83.4 vs Qwen2.5-72B 83.3 and GPT-4o-0806 84.8 (on evaluated subsets)

Smaller Bi'an model beats mid-sized closed models on many cases.

NumbersBi'an-qwen-7B avg 80.2 vs GPT-4o-mini 78.9 (evaluated subsets)

Parametric knowledge conflicts harm hallucination detection decisions.

Numbers43.9% of GPT-4o's 57 annotated bad cases on Bi'anBench_CF were due to parametric/context conflicts

Training gains come mostly from supervised fine-tuning, with DPO providing smaller incremental gains.

NumbersAblation shows larger SFT stage gains than DPO stage (qualitative; Figure 2)

Results

Bi'anBench size

Value22,992 cases (EN 13,301; ZH 7,757; CF 1,934)

Training samples constructed

Value5,994 SFT samples and 1,713 preference pairs

Accuracy

ValueBi'an-qwen-14B ≈ 83.4

BaselineQwen2.5-72B ≈ 83.3; GPT-4o-0806 84.8

Counterfactual QA subset (Bi'anBench_CF)

ValueBi'an-qwen-7B EN 94.1; ZH 95.1

BaselineQwen2.5-7B EN 93.2; Qwen2.5-14B EN 91.8

Parametric knowledge errors (GPT-4o)

Value43.9% of annotated bad cases linked to parametric/context conflicts

Who Should Care

What To Try In 7 Days

Run Bi'anBench on your RAG system to baseline hallucination detection performance in English and Chinese.

Fine-tune an open 7B judge model on a subset of your RAG data using the paper's ensemble + SFT recipe and measure accuracy vs your current detector.

Run targeted counterfactual/context-conflict tests to see if your detector mistakes model knowledge for context evidence.

Optimization Features

System Optimization

  • DeepSpeed for distributed training

Training Optimization

  • LoRA
  • SFT

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training sample loss: samples where all ensemble models err were discarded, biasing training away from hard cases.
  • Task coverage excludes creative writing; subjective tasks are not represented.
  • Models still lag GPT-4o on numerical computation and some long-context tasks.
  • Planned data/model release was not available at time of writing (release pending).

When Not To Use

  • Detecting hallucinations in creative or highly subjective writing.
  • Tasks needing high-precision numerical computation or deep long-context reasoning where Bi'an models underperform GPT-4o.
  • Use-cases that require fully verified open-source release before deployment.

Failure Modes

  • Parametric knowledge vs context conflicts causing wrong judgments.
  • Loss of difficult training samples due to ensemble construction rules.
  • Synthetic perturbations may not cover all real-world hallucination patterns.

Core Entities

Models

  • Bi'an-qwen-7B
  • Bi'an-qwen-14B
  • Qwen2.5-7B-Instruct
  • Qwen2.5-14B-Instruct
  • Qwen2.5-72B-Instruct
  • Qwen2-7B-Instruct
  • GPT-4o-0806
  • GPT-4o-mini
  • Llama3.1-8B-Instruct
  • Llama3.1-70B-Instruct
  • Lynx-8B-v1.1

Metrics

  • Accuracy

Datasets

  • Bi'anBench
  • Bi'anBench_EN
  • Bi'anBench_ZH
  • Bi'anBench_CF
  • HaluEval
  • RAGTruth
  • HaluBench
  • ASQA
  • IfQA
  • WebNLG
  • WMT21
  • PDC
  • FinanceBench
  • DROP
  • PubMedQA
  • CovidQA
  • CRUD
  • WebQA1.0
  • LawBench
  • CSDS

Benchmarks

  • Bi'anBench
  • HaluEval
  • RAGTruth
  • HaluBench