LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

Overview

Decision SnapshotNeeds Validation

The method is practical: it trains a small adapter, uses self-generated data, and reports clear gains on benchmarks, but tests are limited to specific datasets and audio-only scenarios.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Chun-Yi Kuan, Hung-yi Lee

Links

Abstract / PDF / Code

Why It Matters For Business

Reducing false audio detections increases trust in systems (e.g., emergency alerts) while cutting data and compute needs by training only a small adapter instead of full LLMs.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO Product Manager

Summary TLDR

The authors introduce LISTEN, a contrastive-like training recipe that uses the model's own LLM to synthesize positive and negative audio-text pairs. They keep the backbone LLM frozen, train only a lightweight audio adapter, and show large drops in audio-event hallucination on standard benchmarks while keeping or improving audio QA and semantic tests. The pipeline needs much less audio data (reported 316.4 hours total vs ~10,000 hours for prior large-scale models) and is simple to add to existing ALLMs.

Problem Statement

Audio-aware large language models (ALLMs) often invent sound events that are not present in an audio clip. This "audio hallucination" reduces reliability in real-world tasks (for example, missing alarms). Prior fixes either need large instruction-following audio datasets or fine-tuning the whole LLM, which is costly.

Main Contribution

LISTEN: a contrastive-like method that uses LLM-synthesized negative samples to teach ALLMs what sounds are absent.

Adapter-only training: keep backbone LLM frozen and train a lightweight audio modality adapter (Qformer + linear projection).

Key Findings

Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.

NumbersAcc: Qwen-Audio-Chat 66.0% -> Ours (Positive+Negative) 77.5% (Table 3)

Practical UseGenerate and include negative (absent-sound) examples when training ALLMs to cut false positive sound detections in practice.

Evidence RefTable 3

Weighted F1 for 'no' answers (detecting absent sounds) improves from 63.7% to 77.1% when training with positive+negative samples.

NumbersF1 (W): Qwen-Audio-Chat 63.7% -> Ours (Positive+Negative) 77.1% (Table 3)

Practical UseFor binary presence queries, add negative samples to improve precision on 'no' predictions and reduce spurious alerts.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	77.5% (Ours, Positive+Negative)	66.0% (Qwen-Audio-Chat)	+11.5 pp	Audio hallucination benchmark	Table 3 shows Ours (Positive+Negative) Acc 77.5% vs Qwen 66.0%	Table 3
Weighted F1 (presence binary)	77.1% (Ours, Positive+Negative)	63.7% (Qwen-Audio-Chat)	+13.4 pp	Audio hallucination benchmark	Table 3 F1 (W): Ours 77.1% vs Qwen 63.7%	Table 3

What To Try In 7 Days

Generate positive and negative text descriptions from existing audio metadata using your LLM.

Train a small audio-to-LLM adapter (Qformer + linear layer) with the synthesized data while keeping the LLM frozen.

Evaluate on a simple presence/absence task and compare precision on 'no' answers with and without negatives.

Agent Features

Tool Use

LLM self-generation for dataset creation

Frameworks

LISTEN (contrastive-like training)LoRA

Architectures

Qformer audio adapterWhisper encoder (frozen)LLaMA-3.1-8B (frozen backbone)

Optimization Features

Model Optimization

adapter-only training to avoid LLM fine-tuning

Training Optimization

self-generated synthetic data reduces required audio hours (316.4 hrs reported)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://kuan2jiu99.github.io/Balsa

Risks & Boundaries

Limitations

Focuses on audio-event hallucinations only; speech hallucination not addressed.

Evaluation restricted to a set of audio benchmarks; out-of-domain audio not tested.

When Not To Use

When you need full end-to-end LLM adaptation for a new text-heavy task.

When your use case requires verified human-written audio descriptions rather than synthetic ones.

Failure Modes

Poorly generated negative samples could teach the model wrong absences and harm performance.

Synthetic data biases from the backbone LLM may transfer to the adapter.

Core Entities

Models

LLaMA-3.1-8BGemini-1.5-ProQwen-Audio-ChatSALMONN-7BSALMONN-13BLTU-ASWhisper (encoder)

Metrics

AccuracyF1 (yes)F1 (no)Weighted F1PrecisionRecall

Datasets

AudioSet-20KAudioCapsClothoMACSFSD50KESC50UrbanSound8KVocalSound

Benchmarks

Audio hallucination benchmarkClotho-AQA (audio QA)Synonym and Hypernym Test

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.

Weighted F1 for 'no' answers (detecting absent sounds) improves from 63.7% to 77.1% when training with positive+negative samples.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding