Audio-aware LLMs (Gemini, GPT‑4o-audio) can judge speaking styles with human-like agreement

June 6, 20256 min

Overview

Decision SnapshotNeeds Validation

Empirical results show Gemini matches or exceeds human agreement on these specific style tasks, but experiments are limited in scope and language.

Citations0

Evidence Strength0.70

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: MIT (dataset planned to be released)

At A Glance

Cost impact: 35%

Production readiness: 40%

Novelty: 30%

Authors

Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Links

Abstract / PDF

Why It Matters For Business

Automated ALLM judges can speed and cut cost for speech-style QA and model comparisons, replacing many routine human labels for style-focused tests.

Who Should Care

Summary TLDR

The authors build StyleSet, two tasks to test whether audio-aware LLMs (ALLMs) can automatically judge speaking style in generated speech. They evaluate four spoken language models (SLMs) using two ALLM judges (Gemini-2.5-pro and GPT-4o-audio) and human raters. Gemini's scores correlate with humans as well as or better than humans correlate with each other on these style tasks. Results show ALLMs can serve as automatic evaluators for speaking-style tests, but limits include task scope (only speaking styles), English-only data, turn-taking dialogues, and pointwise scoring.

Problem Statement

Human evaluation of paralinguistic speech aspects (emotion, prosody, emphasis, non-verbal cues) is slow, costly, and noisy. The paper asks: can audio-aware LLMs act as automatic, reliable judges of speaking style for speech-generation models?

Main Contribution

Show ALLMs can automatically judge speaking styles for two targeted tasks.

Release StyleSet: 20 voice-style instruction-following instances and 20 role-playing contexts for automatic evaluation.

Key Findings

Gemini judge correlates with human raters on voice-style instruction-following.

NumbersPearson's r = 0.640 (Gemini–human) vs human–human 0.596

Practical UseUse Gemini to approximate human judgments for style IF tests and reduce human labeling needs for this task.

Evidence RefTable 2, Section 4.2

ALLM (Gemini) also matches or exceeds human agreement on role-playing style scores.

NumbersPearson's r = 0.319 (Gemini–human) vs human–human 0.253

Practical UseGemini can replace some human evaluations for dialogue-style scoring, especially where human agreement is low.

Evidence RefTable 2, Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Voice-style score (human) for 4o-audio3.65 (Likert 15)StyleSet: Voice Style IFHumans rate 4o-audio highest with avg 3.65Table 1, Section 4.2
Gemini–human agreement (Pearson's r)0.640Human–human r = 0.596+0.044StyleSet: Voice Style IF (4 models × 20 instances)Gemini achieves higher correlation with humans than human-human averageTable 2, Section 4.2

What To Try In 7 Days

Run Gemini-2.5-pro as an automatic style judge on a small sample of your TTS/SLM outputs and compare to an existing human rater set.

Adopt the StyleSet voice-style IF prompts to validate new speech model releases for basic style controllability.

Ensemble 3–5 Gemini judge responses per sample to get stable, reproducible style scores.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseMIT (dataset planned to be released)

Risks & Boundaries

Limitations

Only evaluates speaking-style attributes, not other speech quality aspects.

Experiments are English-only; some SLMs may be stronger in Chinese.

When Not To Use

For general speech quality or intelligibility metrics outside style.

To compare similarly poor models where judges struggle to separate them.

Failure Modes

Self-enhancement bias when a model judges itself (noted for 4o-audio judging 4o outputs).

Low discriminative power between closely performing, low-quality SLMs.

Core Entities

Models

GPT-4o-audio (4o-audio)GPT-4o-miniaudio (4o-mini-audio)Gemini-2.5-proStep-AudioQwen-2.5-Omni

Metrics

Likert style score (1–5)Realism binary (0/1)Pearson's r (judge agreement)

Datasets

StyleSet (voice style IF + role-playing)IEMOCAP (role-playing contexts source)

Benchmarks

StyleSet