Overview
The method is practical: it trains a small adapter, uses self-generated data, and reports clear gains on benchmarks, but tests are limited to specific datasets and audio-only scenarios.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Reducing false audio detections increases trust in systems (e.g., emergency alerts) while cutting data and compute needs by training only a small adapter instead of full LLMs.
Who Should Care
Summary TLDR
The authors introduce LISTEN, a contrastive-like training recipe that uses the model's own LLM to synthesize positive and negative audio-text pairs. They keep the backbone LLM frozen, train only a lightweight audio adapter, and show large drops in audio-event hallucination on standard benchmarks while keeping or improving audio QA and semantic tests. The pipeline needs much less audio data (reported 316.4 hours total vs ~10,000 hours for prior large-scale models) and is simple to add to existing ALLMs.
Problem Statement
Audio-aware large language models (ALLMs) often invent sound events that are not present in an audio clip. This "audio hallucination" reduces reliability in real-world tasks (for example, missing alarms). Prior fixes either need large instruction-following audio datasets or fine-tuning the whole LLM, which is costly.
Main Contribution
LISTEN: a contrastive-like method that uses LLM-synthesized negative samples to teach ALLMs what sounds are absent.
Adapter-only training: keep backbone LLM frozen and train a lightweight audio modality adapter (Qformer + linear projection).
Key Findings
Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.
Weighted F1 for 'no' answers (detecting absent sounds) improves from 63.7% to 77.1% when training with positive+negative samples.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 77.5% (Ours, Positive+Negative) | 66.0% (Qwen-Audio-Chat) | +11.5 pp | Audio hallucination benchmark | Table 3 shows Ours (Positive+Negative) Acc 77.5% vs Qwen 66.0% | Table 3 |
| Weighted F1 (presence binary) | 77.1% (Ours, Positive+Negative) | 63.7% (Qwen-Audio-Chat) | +13.4 pp | Audio hallucination benchmark | Table 3 F1 (W): Ours 77.1% vs Qwen 63.7% | Table 3 |
What To Try In 7 Days
Generate positive and negative text descriptions from existing audio metadata using your LLM.
Train a small audio-to-LLM adapter (Qformer + linear layer) with the synthesized data while keeping the LLM frozen.
Evaluate on a simple presence/absence task and compare precision on 'no' answers with and without negatives.
Agent Features
Tool Use
Frameworks
Architectures
Optimization Features
Model Optimization
Training Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Focuses on audio-event hallucinations only; speech hallucination not addressed.
Evaluation restricted to a set of audio benchmarks; out-of-domain audio not tested.
When Not To Use
When you need full end-to-end LLM adaptation for a new text-heavy task.
When your use case requires verified human-written audio descriptions rather than synthetic ones.
Failure Modes
Poorly generated negative samples could teach the model wrong absences and harm performance.
Synthetic data biases from the backbone LLM may transfer to the adapter.

