CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

August 10, 20257 min

Overview

Decision SnapshotReady For Pilot

The benchmark is ready for evaluation use; results are reported across many models, but judge bias and language coverage limit final production claims.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Ming Liu, Yang Xiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy voice or multilingual assistants, factual errors rise when models face other languages or audio inputs; test with parallel speech-text data to find hidden failures.

Who Should Care

Summary TLDR

CCFQA is a new benchmark of parallel text and human speech question-answer pairs across eight languages (1,800 parallel pairs ×8 = 14,400 samples). It measures whether multimodal LLMs give consistent factual answers across languages and between text and speech. Evaluations show sizable drops when models move across languages or from text to audio. The authors also present LLM-SQA, an instruction-tuned model that uses English as a bridge and 5-shot transfer; it matches or beats some commercial audio models on the benchmark. The dataset and evaluation code are published to help stress-test multilingual speech understanding.

Problem Statement

Existing factuality and spoken-QA benchmarks focus on English or single modalities. There is no large, fully parallel speech-text multilingual benchmark to test whether multimodal LLMs keep facts consistent across languages and between text and audio.

Main Contribution

CCFQA: a parallel speech-text factual QA benchmark across 8 languages (1,800 parallel QAs per language; 14,400 samples total).

Systematic evaluation of 5 commercial and open MLLMs showing notable cross-lingual and cross-modal drops in factual accuracy.

Key Findings

Dataset size and scope

Numbers1,800 parallel QAs × 8 languages = 14,400 samples

Practical UseUse CCFQA to run controlled cross-language and cross-modality tests without building parallel speech data yourself.

Evidence RefAbstract; Benchmark Statistics (Table 3)

Cross-lingual and cross-modal gaps are real and measurable

NumbersExample: GPT-4o-mini QA avg F1 63.9 vs XQA 59.7 (drop ~4.2 F1); cross-modal consistency for many models < 70

Practical UseExpect accuracy to fall when moving from English text to other languages or to speech; validate multilingual/voice features explicitly.

Evidence RefTable 6 and Table 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F1GPT-4o-mini QA avg F1 63.9CCFQA (QA average)Table 6 (QA avg row)Table 6
F1LLM-SQA SQA avg F1 52.0GPT-4o-mini-Audio SQA avg F1 47.7+4.3 F1CCFQA (SQA average)Table 6 (SQA rows)Table 6

What To Try In 7 Days

Run CCFQA on your model to measure cross-lingual and cross-modal gaps.

Check ASR WER per language and re-record or improve ASR for high-WER languages.

Try English-as-bridge: 5-shot transfer for low-resource spoken QA before collecting large datasets.

Optimization Features

Training Optimization
Curriculum learning (ASR → SRT → SQA)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only covers text and speech; no vision or other modalities included.

English-bridge few-shot introduces English-centered bias mentioned by authors.

When Not To Use

If you need vision+speech multimodal stress tests (CCFQA is text+speech only).

If target languages or dialects are not among the eight supported languages.

Failure Modes

High ASR error rates (WER) create spurious model failures on speech tests.

Translation or back-translation during dataset construction can introduce subtle phrasing changes.

Core Entities

Models

GPT-4o-miniGPT-4o-mini-AudioPhi-4-MultimodalQwen2-AudioQwen2.5-Omni-3BQwen2.5-Omni-7BLLM-SQA (ours)Gemma3-27B (judge)GemmaX2-9B (base for LLM-SQA)

Metrics

F1AccuracyWord Error Rate (WER)Character Error Rate (CER)

Datasets

CCFQA (this paper)MKQAMOOCCubeXFLEURS

Benchmarks

SimpleQATruthfulQAVoiceBenchSD-QASpeechIQ

Context Entities

Models

Gemma3-27B (used as automatic judge)GemmaX2-9B (pretraining base)Whisper-large-v3

Metrics

F1LLM AccWERCER

Datasets

CCFQA (released)MKQAMOOCCubeXFLEURS