CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

August 10, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Ming Liu, Yang Xiang

Links

Abstract / PDF

Why It Matters For Business

If you deploy voice or multilingual assistants, factual errors rise when models face other languages or audio inputs; test with parallel speech-text data to find hidden failures.

Summary TLDR

CCFQA is a new benchmark of parallel text and human speech question-answer pairs across eight languages (1,800 parallel pairs ×8 = 14,400 samples). It measures whether multimodal LLMs give consistent factual answers across languages and between text and speech. Evaluations show sizable drops when models move across languages or from text to audio. The authors also present LLM-SQA, an instruction-tuned model that uses English as a bridge and 5-shot transfer; it matches or beats some commercial audio models on the benchmark. The dataset and evaluation code are published to help stress-test multilingual speech understanding.

Problem Statement

Existing factuality and spoken-QA benchmarks focus on English or single modalities. There is no large, fully parallel speech-text multilingual benchmark to test whether multimodal LLMs keep facts consistent across languages and between text and audio.

Main Contribution

CCFQA: a parallel speech-text factual QA benchmark across 8 languages (1,800 parallel QAs per language; 14,400 samples total).

Systematic evaluation of 5 commercial and open MLLMs showing notable cross-lingual and cross-modal drops in factual accuracy.

A simple few-shot English-bridge transfer (5-shot) and an LLM-SQA model that improves spoken QA consistency and competes with GPT-4o-mini-Audio on this benchmark.

Key Findings

Dataset size and scope

Numbers1,800 parallel QAs × 8 languages = 14,400 samples

Cross-lingual and cross-modal gaps are real and measurable

NumbersExample: GPT-4o-mini QA avg F1 63.9 vs XQA 59.7 (drop ~4.2 F1); cross-modal consistency for many models < 70

Few-shot English-bridge transfer works well

NumbersLLM-SQA SQA avg F1 52.0 vs GPT-4o-mini-Audio 47.7; LLM Acc ~40.3 vs 40.4

Speech data quality is high but language-dependent ASR errors exist

NumbersWER by language: Eng 3.2%, Cmn 6.8%, Fra 13.8%, Rus 18.2%, Yue 16.8%

Results

F1

ValueGPT-4o-mini QA avg F1 63.9

F1

ValueLLM-SQA SQA avg F1 52.0

BaselineGPT-4o-mini-Audio SQA avg F1 47.7

WER

ValueEnglish WER 3.2%; Mandarin 6.8%; Russian 18.2%

Cross-modal consistency (ratio)

ValueQwen2.5-Omni-7B cross-modal consistency 90.3 (F1 ratio)

Who Should Care

What To Try In 7 Days

Run CCFQA on your model to measure cross-lingual and cross-modal gaps.

Check ASR WER per language and re-record or improve ASR for high-WER languages.

Try English-as-bridge: 5-shot transfer for low-resource spoken QA before collecting large datasets.

Optimization Features

Training Optimization

  • Curriculum learning (ASR → SRT → SQA)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only covers text and speech; no vision or other modalities included.
  • English-bridge few-shot introduces English-centered bias mentioned by authors.
  • Evaluation uses an LLM judge (Gemma3-27B), which can reflect its own language biases.

When Not To Use

  • If you need vision+speech multimodal stress tests (CCFQA is text+speech only).
  • If target languages or dialects are not among the eight supported languages.
  • For safety-critical single-instance verification without human review—judge LLMs can err.

Failure Modes

  • High ASR error rates (WER) create spurious model failures on speech tests.
  • Translation or back-translation during dataset construction can introduce subtle phrasing changes.
  • Automated judge disagreements: LLM-based accuracy may mismatch human judgment in some languages.

Core Entities

Models

  • GPT-4o-mini
  • GPT-4o-mini-Audio
  • Phi-4-Multimodal
  • Qwen2-Audio
  • Qwen2.5-Omni-3B
  • Qwen2.5-Omni-7B
  • LLM-SQA (ours)
  • Gemma3-27B (judge)
  • GemmaX2-9B (base for LLM-SQA)

Metrics

  • F1
  • Accuracy
  • Word Error Rate (WER)
  • Character Error Rate (CER)

Datasets

  • CCFQA (this paper)
  • MKQA
  • MOOCCubeX
  • FLEURS

Benchmarks

  • SimpleQA
  • TruthfulQA
  • VoiceBench
  • SD-QA
  • SpeechIQ

Context Entities

Models

  • Gemma3-27B (used as automatic judge)
  • GemmaX2-9B (pretraining base)
  • Whisper-large-v3

Metrics

  • F1
  • LLM Acc
  • WER
  • CER

Datasets

  • CCFQA (released)
  • MKQA
  • MOOCCubeX
  • FLEURS