LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

May 20, 20257 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it trains a small adapter, uses self-generated data, and reports clear gains on benchmarks, but tests are limited to specific datasets and audio-only scenarios.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Chun-Yi Kuan, Hung-yi Lee

Links

Abstract / PDF / Code

Why It Matters For Business

Reducing false audio detections increases trust in systems (e.g., emergency alerts) while cutting data and compute needs by training only a small adapter instead of full LLMs.

Who Should Care

Summary TLDR

The authors introduce LISTEN, a contrastive-like training recipe that uses the model's own LLM to synthesize positive and negative audio-text pairs. They keep the backbone LLM frozen, train only a lightweight audio adapter, and show large drops in audio-event hallucination on standard benchmarks while keeping or improving audio QA and semantic tests. The pipeline needs much less audio data (reported 316.4 hours total vs ~10,000 hours for prior large-scale models) and is simple to add to existing ALLMs.

Problem Statement

Audio-aware large language models (ALLMs) often invent sound events that are not present in an audio clip. This "audio hallucination" reduces reliability in real-world tasks (for example, missing alarms). Prior fixes either need large instruction-following audio datasets or fine-tuning the whole LLM, which is costly.

Main Contribution

LISTEN: a contrastive-like method that uses LLM-synthesized negative samples to teach ALLMs what sounds are absent.

Adapter-only training: keep backbone LLM frozen and train a lightweight audio modality adapter (Qformer + linear projection).

Key Findings

Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.

NumbersAcc: Qwen-Audio-Chat 66.0% -> Ours (Positive+Negative) 77.5% (Table 3)

Practical UseGenerate and include negative (absent-sound) examples when training ALLMs to cut false positive sound detections in practice.

Evidence RefTable 3

Weighted F1 for 'no' answers (detecting absent sounds) improves from 63.7% to 77.1% when training with positive+negative samples.

NumbersF1 (W): Qwen-Audio-Chat 63.7% -> Ours (Positive+Negative) 77.1% (Table 3)

Practical UseFor binary presence queries, add negative samples to improve precision on 'no' predictions and reduce spurious alerts.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy77.5% (Ours, Positive+Negative)66.0% (Qwen-Audio-Chat)+11.5 ppAudio hallucination benchmarkTable 3 shows Ours (Positive+Negative) Acc 77.5% vs Qwen 66.0%Table 3
Weighted F1 (presence binary)77.1% (Ours, Positive+Negative)63.7% (Qwen-Audio-Chat)+13.4 ppAudio hallucination benchmarkTable 3 F1 (W): Ours 77.1% vs Qwen 63.7%Table 3

What To Try In 7 Days

Generate positive and negative text descriptions from existing audio metadata using your LLM.

Train a small audio-to-LLM adapter (Qformer + linear layer) with the synthesized data while keeping the LLM frozen.

Evaluate on a simple presence/absence task and compare precision on 'no' answers with and without negatives.

Agent Features

Tool Use
LLM self-generation for dataset creation
Frameworks
LISTEN (contrastive-like training)LoRA
Architectures
Qformer audio adapterWhisper encoder (frozen)LLaMA-3.1-8B (frozen backbone)

Optimization Features

Model Optimization
adapter-only training to avoid LLM fine-tuning
Training Optimization
self-generated synthetic data reduces required audio hours (316.4 hrs reported)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focuses on audio-event hallucinations only; speech hallucination not addressed.

Evaluation restricted to a set of audio benchmarks; out-of-domain audio not tested.

When Not To Use

When you need full end-to-end LLM adaptation for a new text-heavy task.

When your use case requires verified human-written audio descriptions rather than synthetic ones.

Failure Modes

Poorly generated negative samples could teach the model wrong absences and harm performance.

Synthetic data biases from the backbone LLM may transfer to the adapter.

Core Entities

Models

LLaMA-3.1-8BGemini-1.5-ProQwen-Audio-ChatSALMONN-7BSALMONN-13BLTU-ASWhisper (encoder)

Metrics

AccuracyF1 (yes)F1 (no)Weighted F1PrecisionRecall

Datasets

AudioSet-20KAudioCapsClothoMACSFSD50KESC50UrbanSound8KVocalSound

Benchmarks

Audio hallucination benchmarkClotho-AQA (audio QA)Synonym and Hypernym Test