Audio Flamingo: an audio-aware LLM with few-shot learning, retrieval, and multi-turn chat

February 2, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

4

Authors

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

Links

Abstract / PDF

Why It Matters For Business

Audio Flamingo brings strong audio understanding and fast few-shot adaptation into a single, relatively small model—useful for building audio assistants, content moderation, audio search, and music tools without costly per-task fine-tuning.

Summary TLDR

Audio Flamingo is a Flamingo-style audio language model that fuses sliding-window audio features into a decoder LM via gated cross-attention. Trained on ~5.9M audio–text pairs with a two-stage pretrain + supervised fine-tune pipeline, it adds retrieval-augmented in‑context learning (RAG) and GPT-4–generated multi-turn dialogue data. The model (2.2B params) sets new state-of-the-art results across many audio benchmarks, improves few-shot accuracy with retrieved examples, and yields a strong chat model for multi-turn audio dialogue.

Problem Statement

Current LLMs mostly ignore non-speech sounds and non-verbal audio. Prior audio–LLMs either lose temporal detail or lack few-shot and multi-turn dialogue abilities. The paper aims to build a single model that understands diverse audio, adapts quickly via in-context examples and retrieval, and supports multi-turn chat without per-task fine-tuning.

Main Contribution

Audio Flamingo: a Flamingo-like architecture that conditions a decoder LM on sliding-window audio via gated cross-attention.

A training recipe with pretraining + supervised fine-tune, interleaved ICL samples, and retrieval-based ICL to enable few-shot generalization.

Two GPT-4–generated multi-turn dialogue datasets and a chat fine-tune that give strong multi-turn audio Q&A performance.

Key Findings

Outperforms prior SOTA on many audio benchmarks (captioning, QA, classification).

NumbersClotho-v2 CIDEr 0.465 vs 0.441; ClothoAQA unanimous 86.9% vs 74.9%

Large gains on some perceptual tasks, notably audio quality and source prediction.

NumbersNSynth quality F1 66.7% vs 46.3% (+20.4pp); NSynth source ACC 78.7% vs 60.1% (+18.6pp)

Retrieval-augmented in-context learning (RAG+ICL) substantially improves few-shot performance.

NumbersAudioCaps 4-shot CIDEr 0.518 vs RECAP 0.359 (+0.159); classification avg >10% improvement over zero-shot

Strong multi-turn dialogue capability after dialogue fine-tuning.

NumbersDialogue CIDEr (AF-Dialogue-A) 1.622 vs LTU† 0.823 (+0.799)

Can adapt to completely new labels via few-shot retrieval.

NumbersBG-Gun-Sound 1.6% → 53.5% (zero-shot → few-shot)

Results

CIDEr (Clotho-v2 captioning)

Value0.465

Baseline0.441

Accuracy

Value86.9%

Baseline74.9%

F1 (NSynth quality)

Value66.7%

Baseline46.3%

CIDEr (AudioCaps zero-shot)

Value0.502

Baseline0.281

CIDEr (AudioCaps retrieval-augmented 4-shot)

Value0.518

Baseline0.359 (RECAP, 4-shot)

CIDEr (Multi-turn dialogue AF-Dialogue-A)

Value1.622

Baseline0.823 (LTU†)

Accuracy

Value53.5%

Baseline1.6%

F1 approx (FSD50k multilabel)

Value69.7%

Baseline65.6%

Who Should Care

What To Try In 7 Days

Index a small local audio dataset with LAION-CLAP embeddings and test RAG + 8-shot prompts to evaluate few-shot gains.

Run the open-source demo/code to reproduce captioning and QA on a few in-house audio samples.

Fine-tune or SFT the chat model on a small set of domain dialogs to prototype an audio-aware assistant.

Agent Features

Memory

  • in-context learning (ICL) via interleaved examples
  • retrieval-augmented context (RAG)

Tool Use

  • Faiss for k-NN retrieval
  • LAION-CLAP embeddings for retrieval

Architectures

  • decoder-only LM (OPT-IML-MAX-1.3B)
  • gated xattn-dense cross-attention
  • sliding-window audio extractor (ClapCap)

Optimization Features

Token Efficiency

  • cross-attention conditioning gives linear complexity in audio windows (m) vs quadratic token prepend

Infra Optimization

  • SFT

Model Optimization

  • instruction-tuned decoder LM backbone
  • trainable audio transformation layers (3 self-attention layers)

Training Optimization

  • two-stage: pretraining (freeze LM) then supervised fine-tune (unfreeze LM)
  • interleaved ICL samples and dataset-weighted sampling
  • block upper-triangular cross-attention masks for interleaved conditioning

Reproducibility

Data Urls

  • AudioSet, AudioCaps, Clotho, LAION-CLAP (public datasets referenced in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Maximum supported audio per run is ~33.25s (16 windows); longer audio is cropped.
  • Some benchmarks show lower scores than specific baselines (e.g., CochlScene accuracy lower vs prior SOTA).
  • Model focuses on general audio and music; complex speech tasks beyond transcription are not fully addressed.
  • Training stability required freezing the audio encoder; unfreezing caused instability in early experiments.

When Not To Use

  • For audio longer than ~33s without pre-segmentation.
  • When task demands advanced speech processing (ASR, diarization) beyond captioning/QA.
  • If strict computational/latency limits prevent using a 2.2B-model with cross-attention and retrieval.

Failure Modes

  • Performance is dataset-dependent; few-shot gains vary by dataset.
  • Possible hallucinations in open-ended captions or QA when retrieval is noisy or absent.
  • Biases and gaps from heterogeneous training data can affect rare labels.
  • Dialogue behavior depends on quality of synthetic GPT-4 dialogues; generated data may inject artifacts.

Core Entities

Models

  • Audio Flamingo
  • OPT-IML-MAX-1.3B
  • ClapCap
  • LAION-CLAP
  • Pengi
  • Qwen-Audio
  • LTU
  • MU-LLaMA
  • RECAP

Metrics

  • CIDEr
  • Accuracy
  • F1
  • Bleu4
  • Rouge-L

Datasets

  • AudioCaps
  • Clotho-v2
  • ClothoAQA
  • NSynth
  • FSD50k
  • Medley-solos-DB
  • AudioSet
  • LP-MusicCaps
  • MusicCaps
  • MusicAVQA
  • CREMA-D
  • Ravdess
  • US8K
  • GTZAN
  • AF-Dialogue-AudioSetSL
  • AF-Dialogue-MusicCaps

Benchmarks

  • Audio captioning (CIDEr)
  • Accuracy
  • Multi-turn dialogue (CIDEr/Bleu4/Rouge-L)