Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If you deploy voice or multilingual assistants, factual errors rise when models face other languages or audio inputs; test with parallel speech-text data to find hidden failures.
Summary TLDR
CCFQA is a new benchmark of parallel text and human speech question-answer pairs across eight languages (1,800 parallel pairs ×8 = 14,400 samples). It measures whether multimodal LLMs give consistent factual answers across languages and between text and speech. Evaluations show sizable drops when models move across languages or from text to audio. The authors also present LLM-SQA, an instruction-tuned model that uses English as a bridge and 5-shot transfer; it matches or beats some commercial audio models on the benchmark. The dataset and evaluation code are published to help stress-test multilingual speech understanding.
Problem Statement
Existing factuality and spoken-QA benchmarks focus on English or single modalities. There is no large, fully parallel speech-text multilingual benchmark to test whether multimodal LLMs keep facts consistent across languages and between text and audio.
Main Contribution
CCFQA: a parallel speech-text factual QA benchmark across 8 languages (1,800 parallel QAs per language; 14,400 samples total).
Systematic evaluation of 5 commercial and open MLLMs showing notable cross-lingual and cross-modal drops in factual accuracy.
A simple few-shot English-bridge transfer (5-shot) and an LLM-SQA model that improves spoken QA consistency and competes with GPT-4o-mini-Audio on this benchmark.
Key Findings
Dataset size and scope
Cross-lingual and cross-modal gaps are real and measurable
Few-shot English-bridge transfer works well
Speech data quality is high but language-dependent ASR errors exist
Results
F1
F1
WER
Cross-modal consistency (ratio)
Who Should Care
What To Try In 7 Days
Run CCFQA on your model to measure cross-lingual and cross-modal gaps.
Check ASR WER per language and re-record or improve ASR for high-WER languages.
Try English-as-bridge: 5-shot transfer for low-resource spoken QA before collecting large datasets.
Optimization Features
Training Optimization
- Curriculum learning (ASR → SRT → SQA)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only covers text and speech; no vision or other modalities included.
- English-bridge few-shot introduces English-centered bias mentioned by authors.
- Evaluation uses an LLM judge (Gemma3-27B), which can reflect its own language biases.
When Not To Use
- If you need vision+speech multimodal stress tests (CCFQA is text+speech only).
- If target languages or dialects are not among the eight supported languages.
- For safety-critical single-instance verification without human review—judge LLMs can err.
Failure Modes
- High ASR error rates (WER) create spurious model failures on speech tests.
- Translation or back-translation during dataset construction can introduce subtle phrasing changes.
- Automated judge disagreements: LLM-based accuracy may mismatch human judgment in some languages.
Core Entities
Models
- GPT-4o-mini
- GPT-4o-mini-Audio
- Phi-4-Multimodal
- Qwen2-Audio
- Qwen2.5-Omni-3B
- Qwen2.5-Omni-7B
- LLM-SQA (ours)
- Gemma3-27B (judge)
- GemmaX2-9B (base for LLM-SQA)
Metrics
- F1
- Accuracy
- Word Error Rate (WER)
- Character Error Rate (CER)
Datasets
- CCFQA (this paper)
- MKQA
- MOOCCubeX
- FLEURS
Benchmarks
- SimpleQA
- TruthfulQA
- VoiceBench
- SD-QA
- SpeechIQ
Context Entities
Models
- Gemma3-27B (used as automatic judge)
- GemmaX2-9B (pretraining base)
- Whisper-large-v3
Metrics
- F1
- LLM Acc
- WER
- CER
Datasets
- CCFQA (released)
- MKQA
- MOOCCubeX
- FLEURS

