Audio-aware LLMs (Gemini, GPT‑4o-audio) can judge speaking styles with human-like agreement

June 6, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.35

Citation Count

0

Authors

Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Links

Abstract / PDF

Why It Matters For Business

Automated ALLM judges can speed and cut cost for speech-style QA and model comparisons, replacing many routine human labels for style-focused tests.

Summary TLDR

The authors build StyleSet, two tasks to test whether audio-aware LLMs (ALLMs) can automatically judge speaking style in generated speech. They evaluate four spoken language models (SLMs) using two ALLM judges (Gemini-2.5-pro and GPT-4o-audio) and human raters. Gemini's scores correlate with humans as well as or better than humans correlate with each other on these style tasks. Results show ALLMs can serve as automatic evaluators for speaking-style tests, but limits include task scope (only speaking styles), English-only data, turn-taking dialogues, and pointwise scoring.

Problem Statement

Human evaluation of paralinguistic speech aspects (emotion, prosody, emphasis, non-verbal cues) is slow, costly, and noisy. The paper asks: can audio-aware LLMs act as automatic, reliable judges of speaking style for speech-generation models?

Main Contribution

Show ALLMs can automatically judge speaking styles for two targeted tasks.

Release StyleSet: 20 voice-style instruction-following instances and 20 role-playing contexts for automatic evaluation.

Empirically compare two ALLM judges (Gemini-2.5-pro, GPT‑4o-audio) with human raters and analyze strengths and limits.

Key Findings

Gemini judge correlates with human raters on voice-style instruction-following.

NumbersPearson's r = 0.640 (Gemini–human) vs human–human 0.596

ALLM (Gemini) also matches or exceeds human agreement on role-playing style scores.

NumbersPearson's r = 0.319 (Gemini–human) vs human–human 0.253

The best SLM (4o-audio) partially follows style instructions but is not perfect.

NumbersHuman average voice-style score for 4o-audio = 3.65 (Likert 1–5)

Gemini judge outputs are stable across generation temperature settings.

NumbersHuman–Gemini Pearson's r varies 0.640 to 0.649 across temps

Results

Voice-style score (human) for 4o-audio

Value3.65 (Likert 1–5)

Gemini–human agreement (Pearson's r)

Value0.640

BaselineHuman–human r = 0.596

Human–human agreement (Pearson's r)

Value0.253

Gemini–human agreement (Pearson's r) for role-playing (style)

Value0.319

BaselineHuman–human r = 0.253

Judge realism score: human-recorded vs 4o-generated

ValueHuman-recorded realism notably higher than SLM outputs

Who Should Care

What To Try In 7 Days

Run Gemini-2.5-pro as an automatic style judge on a small sample of your TTS/SLM outputs and compare to an existing human rater set.

Adopt the StyleSet voice-style IF prompts to validate new speech model releases for basic style controllability.

Ensemble 3–5 Gemini judge responses per sample to get stable, reproducible style scores.

Reproducibility

License

  • MIT (dataset planned to be released)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only evaluates speaking-style attributes, not other speech quality aspects.
  • Experiments are English-only; some SLMs may be stronger in Chinese.
  • Role-play dialogues use turn-taking, not full-duplex human conversation.
  • Only pointwise (single-instance) scoring; no pairwise comparisons studied.

When Not To Use

  • For general speech quality or intelligibility metrics outside style.
  • To compare similarly poor models where judges struggle to separate them.
  • For languages other than English without revalidation.

Failure Modes

  • Self-enhancement bias when a model judges itself (noted for 4o-audio judging 4o outputs).
  • Low discriminative power between closely performing, low-quality SLMs.
  • Judge outputs might reflect ALLM priors, not true human perception for unseen attributes.

Core Entities

Models

  • GPT-4o-audio (4o-audio)
  • GPT-4o-miniaudio (4o-mini-audio)
  • Gemini-2.5-pro
  • Step-Audio
  • Qwen-2.5-Omni

Metrics

  • Likert style score (1–5)
  • Realism binary (0/1)
  • Pearson's r (judge agreement)

Datasets

  • StyleSet (voice style IF + role-playing)
  • IEMOCAP (role-playing contexts source)

Benchmarks

  • StyleSet