Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Reducing false audio detections increases trust in systems (e.g., emergency alerts) while cutting data and compute needs by training only a small adapter instead of full LLMs.
Summary TLDR
The authors introduce LISTEN, a contrastive-like training recipe that uses the model's own LLM to synthesize positive and negative audio-text pairs. They keep the backbone LLM frozen, train only a lightweight audio adapter, and show large drops in audio-event hallucination on standard benchmarks while keeping or improving audio QA and semantic tests. The pipeline needs much less audio data (reported 316.4 hours total vs ~10,000 hours for prior large-scale models) and is simple to add to existing ALLMs.
Problem Statement
Audio-aware large language models (ALLMs) often invent sound events that are not present in an audio clip. This "audio hallucination" reduces reliability in real-world tasks (for example, missing alarms). Prior fixes either need large instruction-following audio datasets or fine-tuning the whole LLM, which is costly.
Main Contribution
LISTEN: a contrastive-like method that uses LLM-synthesized negative samples to teach ALLMs what sounds are absent.
Adapter-only training: keep backbone LLM frozen and train a lightweight audio modality adapter (Qformer + linear projection).
Data efficiency: use self-generated data requiring 316.4 hours (reported) versus ~10,000 hours used by some baselines.
Empirical gains: substantially reduced hallucination and competitive or better performance on audio QA and semantic tests.
Key Findings
Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.
Weighted F1 for 'no' answers (detecting absent sounds) improves from 63.7% to 77.1% when training with positive+negative samples.
The method uses only 316.4 hours of audio (308K samples) for training versus ~10,000 hours reported for a large baseline.
Training only the audio adapter and keeping the backbone LLM frozen achieved the gains reported.
Results
Accuracy
Weighted F1 (presence binary)
Accuracy
Training data volume
Who Should Care
What To Try In 7 Days
Generate positive and negative text descriptions from existing audio metadata using your LLM.
Train a small audio-to-LLM adapter (Qformer + linear layer) with the synthesized data while keeping the LLM frozen.
Evaluate on a simple presence/absence task and compare precision on 'no' answers with and without negatives.
Agent Features
Tool Use
- LLM self-generation for dataset creation
Frameworks
- LISTEN (contrastive-like training)
- LoRA
Architectures
- Qformer audio adapter
- Whisper encoder (frozen)
- LLaMA-3.1-8B (frozen backbone)
Optimization Features
Model Optimization
- adapter-only training to avoid LLM fine-tuning
Training Optimization
- self-generated synthetic data reduces required audio hours (316.4 hrs reported)
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses on audio-event hallucinations only; speech hallucination not addressed.
- Evaluation restricted to a set of audio benchmarks; out-of-domain audio not tested.
- Relies on synthetic labels produced by the backbone LLM, which may inherit model biases.
When Not To Use
- When you need full end-to-end LLM adaptation for a new text-heavy task.
- When your use case requires verified human-written audio descriptions rather than synthetic ones.
- For multimodal tasks that require visual context; this work is audio-only.
Failure Modes
- Poorly generated negative samples could teach the model wrong absences and harm performance.
- Synthetic data biases from the backbone LLM may transfer to the adapter.
- Adapter alignment may fail on audio types not covered by the small training set.
Core Entities
Models
- LLaMA-3.1-8B
- Gemini-1.5-Pro
- Qwen-Audio-Chat
- SALMONN-7B
- SALMONN-13B
- LTU-AS
- Whisper (encoder)
Metrics
- Accuracy
- F1 (yes)
- F1 (no)
- Weighted F1
- Precision
- Recall
Datasets
- AudioSet-20K
- AudioCaps
- Clotho
- MACS
- FSD50K
- ESC50
- UrbanSound8K
- VocalSound
Benchmarks
- Audio hallucination benchmark
- Clotho-AQA (audio QA)
- Synonym and Hypernym Test

