Overview
The benchmark is ready for evaluation use; results are reported across many models, but judge bias and language coverage limit final production claims.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
If you deploy voice or multilingual assistants, factual errors rise when models face other languages or audio inputs; test with parallel speech-text data to find hidden failures.
Who Should Care
Summary TLDR
CCFQA is a new benchmark of parallel text and human speech question-answer pairs across eight languages (1,800 parallel pairs ×8 = 14,400 samples). It measures whether multimodal LLMs give consistent factual answers across languages and between text and speech. Evaluations show sizable drops when models move across languages or from text to audio. The authors also present LLM-SQA, an instruction-tuned model that uses English as a bridge and 5-shot transfer; it matches or beats some commercial audio models on the benchmark. The dataset and evaluation code are published to help stress-test multilingual speech understanding.
Problem Statement
Existing factuality and spoken-QA benchmarks focus on English or single modalities. There is no large, fully parallel speech-text multilingual benchmark to test whether multimodal LLMs keep facts consistent across languages and between text and audio.
Main Contribution
CCFQA: a parallel speech-text factual QA benchmark across 8 languages (1,800 parallel QAs per language; 14,400 samples total).
Systematic evaluation of 5 commercial and open MLLMs showing notable cross-lingual and cross-modal drops in factual accuracy.
Key Findings
Dataset size and scope
Cross-lingual and cross-modal gaps are real and measurable
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1 | GPT-4o-mini QA avg F1 63.9 | — | — | CCFQA (QA average) | Table 6 (QA avg row) | Table 6 |
| F1 | LLM-SQA SQA avg F1 52.0 | GPT-4o-mini-Audio SQA avg F1 47.7 | +4.3 F1 | CCFQA (SQA average) | Table 6 (SQA rows) | Table 6 |
What To Try In 7 Days
Run CCFQA on your model to measure cross-lingual and cross-modal gaps.
Check ASR WER per language and re-record or improve ASR for high-WER languages.
Try English-as-bridge: 5-shot transfer for low-resource spoken QA before collecting large datasets.
Optimization Features
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Only covers text and speech; no vision or other modalities included.
English-bridge few-shot introduces English-centered bias mentioned by authors.
When Not To Use
If you need vision+speech multimodal stress tests (CCFQA is text+speech only).
If target languages or dialects are not among the eight supported languages.
Failure Modes
High ASR error rates (WER) create spurious model failures on speech tests.
Translation or back-translation during dataset construction can introduce subtle phrasing changes.

