CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Overview

Decision SnapshotReady For Pilot

The benchmark is ready for evaluation use; results are reported across many models, but judge bias and language coverage limit final production claims.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Ming Liu, Yang Xiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy voice or multilingual assistants, factual errors rise when models face other languages or audio inputs; test with parallel speech-text data to find hidden failures.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

CCFQA is a new benchmark of parallel text and human speech question-answer pairs across eight languages (1,800 parallel pairs ×8 = 14,400 samples). It measures whether multimodal LLMs give consistent factual answers across languages and between text and speech. Evaluations show sizable drops when models move across languages or from text to audio. The authors also present LLM-SQA, an instruction-tuned model that uses English as a bridge and 5-shot transfer; it matches or beats some commercial audio models on the benchmark. The dataset and evaluation code are published to help stress-test multilingual speech understanding.

Problem Statement

Existing factuality and spoken-QA benchmarks focus on English or single modalities. There is no large, fully parallel speech-text multilingual benchmark to test whether multimodal LLMs keep facts consistent across languages and between text and audio.

Main Contribution

CCFQA: a parallel speech-text factual QA benchmark across 8 languages (1,800 parallel QAs per language; 14,400 samples total).

Systematic evaluation of 5 commercial and open MLLMs showing notable cross-lingual and cross-modal drops in factual accuracy.

Key Findings

Dataset size and scope

Numbers1,800 parallel QAs × 8 languages = 14,400 samples

Practical UseUse CCFQA to run controlled cross-language and cross-modality tests without building parallel speech data yourself.

Evidence RefAbstract; Benchmark Statistics (Table 3)

Cross-lingual and cross-modal gaps are real and measurable

NumbersExample: GPT-4o-mini QA avg F1 63.9 vs XQA 59.7 (drop ~4.2 F1); cross-modal consistency for many models < 70

Practical UseExpect accuracy to fall when moving from English text to other languages or to speech; validate multilingual/voice features explicitly.

Evidence RefTable 6 and Table 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1	GPT-4o-mini QA avg F1 63.9	—	—	CCFQA (QA average)	Table 6 (QA avg row)	Table 6
F1	LLM-SQA SQA avg F1 52.0	GPT-4o-mini-Audio SQA avg F1 47.7	+4.3 F1	CCFQA (SQA average)	Table 6 (SQA rows)	Table 6

What To Try In 7 Days

Run CCFQA on your model to measure cross-lingual and cross-modal gaps.

Check ASR WER per language and re-record or improve ASR for high-WER languages.

Try English-as-bridge: 5-shot transfer for low-resource spoken QA before collecting large datasets.

Optimization Features

Training Optimization

Curriculum learning (ASR → SRT → SQA)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yxduir/ccfqa

Data URLs

https://github.com/yxduir/ccfqa

Risks & Boundaries

Limitations

Only covers text and speech; no vision or other modalities included.

English-bridge few-shot introduces English-centered bias mentioned by authors.

When Not To Use

If you need vision+speech multimodal stress tests (CCFQA is text+speech only).

If target languages or dialects are not among the eight supported languages.

Failure Modes

High ASR error rates (WER) create spurious model failures on speech tests.

Translation or back-translation during dataset construction can introduce subtle phrasing changes.

Core Entities

Models

GPT-4o-miniGPT-4o-mini-AudioPhi-4-MultimodalQwen2-AudioQwen2.5-Omni-3BQwen2.5-Omni-7BLLM-SQA (ours)Gemma3-27B (judge)GemmaX2-9B (base for LLM-SQA)

Metrics

F1AccuracyWord Error Rate (WER)Character Error Rate (CER)

Datasets

CCFQA (this paper)MKQAMOOCCubeXFLEURS

Benchmarks

SimpleQATruthfulQAVoiceBenchSD-QASpeechIQ

Context Entities

Models

Gemma3-27B (used as automatic judge)GemmaX2-9B (pretraining base)Whisper-large-v3

Metrics

F1LLM AccWERCER

Datasets

CCFQA (released)MKQAMOOCCubeXFLEURS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and scope

Cross-lingual and cross-modal gaps are real and measurable

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding