LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

May 20, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Chun-Yi Kuan, Hung-yi Lee

Links

Abstract / PDF

Why It Matters For Business

Reducing false audio detections increases trust in systems (e.g., emergency alerts) while cutting data and compute needs by training only a small adapter instead of full LLMs.

Summary TLDR

The authors introduce LISTEN, a contrastive-like training recipe that uses the model's own LLM to synthesize positive and negative audio-text pairs. They keep the backbone LLM frozen, train only a lightweight audio adapter, and show large drops in audio-event hallucination on standard benchmarks while keeping or improving audio QA and semantic tests. The pipeline needs much less audio data (reported 316.4 hours total vs ~10,000 hours for prior large-scale models) and is simple to add to existing ALLMs.

Problem Statement

Audio-aware large language models (ALLMs) often invent sound events that are not present in an audio clip. This "audio hallucination" reduces reliability in real-world tasks (for example, missing alarms). Prior fixes either need large instruction-following audio datasets or fine-tuning the whole LLM, which is costly.

Main Contribution

LISTEN: a contrastive-like method that uses LLM-synthesized negative samples to teach ALLMs what sounds are absent.

Adapter-only training: keep backbone LLM frozen and train a lightweight audio modality adapter (Qformer + linear projection).

Data efficiency: use self-generated data requiring 316.4 hours (reported) versus ~10,000 hours used by some baselines.

Empirical gains: substantially reduced hallucination and competitive or better performance on audio QA and semantic tests.

Key Findings

Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.

NumbersAcc: Qwen-Audio-Chat 66.0% -> Ours (Positive+Negative) 77.5% (Table 3)

Weighted F1 for 'no' answers (detecting absent sounds) improves from 63.7% to 77.1% when training with positive+negative samples.

NumbersF1 (W): Qwen-Audio-Chat 63.7% -> Ours (Positive+Negative) 77.1% (Table 3)

The method uses only 316.4 hours of audio (308K samples) for training versus ~10,000 hours reported for a large baseline.

NumbersTotal (Ours) 316.4 hrs vs Qwen-Audio ~10000 hrs (Table 2)

Training only the audio adapter and keeping the backbone LLM frozen achieved the gains reported.

NumbersMethod: adapter-only (Qformer + linear projection); backbone LLaMA-3.1-8B frozen (Sec. 3.2)

Results

Accuracy

Value77.5% (Ours, Positive+Negative)

Baseline66.0% (Qwen-Audio-Chat)

Weighted F1 (presence binary)

Value77.1% (Ours, Positive+Negative)

Baseline63.7% (Qwen-Audio-Chat)

Accuracy

Value84.3% (Ours, Positive+Negative)

Baseline74.9% (Qwen-Audio-Chat)

Training data volume

Value316.4 hours (reported total used)

Baseline~10000 hours (Qwen-Audio reference)

Who Should Care

What To Try In 7 Days

Generate positive and negative text descriptions from existing audio metadata using your LLM.

Train a small audio-to-LLM adapter (Qformer + linear layer) with the synthesized data while keeping the LLM frozen.

Evaluate on a simple presence/absence task and compare precision on 'no' answers with and without negatives.

Agent Features

Tool Use

  • LLM self-generation for dataset creation

Frameworks

  • LISTEN (contrastive-like training)
  • LoRA

Architectures

  • Qformer audio adapter
  • Whisper encoder (frozen)
  • LLaMA-3.1-8B (frozen backbone)

Optimization Features

Model Optimization

  • adapter-only training to avoid LLM fine-tuning

Training Optimization

  • self-generated synthetic data reduces required audio hours (316.4 hrs reported)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses on audio-event hallucinations only; speech hallucination not addressed.
  • Evaluation restricted to a set of audio benchmarks; out-of-domain audio not tested.
  • Relies on synthetic labels produced by the backbone LLM, which may inherit model biases.

When Not To Use

  • When you need full end-to-end LLM adaptation for a new text-heavy task.
  • When your use case requires verified human-written audio descriptions rather than synthetic ones.
  • For multimodal tasks that require visual context; this work is audio-only.

Failure Modes

  • Poorly generated negative samples could teach the model wrong absences and harm performance.
  • Synthetic data biases from the backbone LLM may transfer to the adapter.
  • Adapter alignment may fail on audio types not covered by the small training set.

Core Entities

Models

  • LLaMA-3.1-8B
  • Gemini-1.5-Pro
  • Qwen-Audio-Chat
  • SALMONN-7B
  • SALMONN-13B
  • LTU-AS
  • Whisper (encoder)

Metrics

  • Accuracy
  • F1 (yes)
  • F1 (no)
  • Weighted F1
  • Precision
  • Recall

Datasets

  • AudioSet-20K
  • AudioCaps
  • Clotho
  • MACS
  • FSD50K
  • ESC50
  • UrbanSound8K
  • VocalSound

Benchmarks

  • Audio hallucination benchmark
  • Clotho-AQA (audio QA)
  • Synonym and Hypernym Test