Audio Flamingo: an audio-aware LLM with few-shot learning, retrieval, and multi-turn chat

February 2, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper provides numeric comparisons across many public benchmarks and ablations; results are convincing for 1.3–2.2B scale but generalization to larger LMs and long audio (>33s) needs testing.

Citations4

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 8/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Audio Flamingo brings strong audio understanding and fast few-shot adaptation into a single, relatively small model—useful for building audio assistants, content moderation, audio search, and music tools without costly per-task fine-tuning.

Who Should Care

Summary TLDR

Audio Flamingo is a Flamingo-style audio language model that fuses sliding-window audio features into a decoder LM via gated cross-attention. Trained on ~5.9M audio–text pairs with a two-stage pretrain + supervised fine-tune pipeline, it adds retrieval-augmented in‑context learning (RAG) and GPT-4–generated multi-turn dialogue data. The model (2.2B params) sets new state-of-the-art results across many audio benchmarks, improves few-shot accuracy with retrieved examples, and yields a strong chat model for multi-turn audio dialogue.

Problem Statement

Current LLMs mostly ignore non-speech sounds and non-verbal audio. Prior audio–LLMs either lose temporal detail or lack few-shot and multi-turn dialogue abilities. The paper aims to build a single model that understands diverse audio, adapts quickly via in-context examples and retrieval, and supports multi-turn chat without per-task fine-tuning.

Main Contribution

Audio Flamingo: a Flamingo-like architecture that conditions a decoder LM on sliding-window audio via gated cross-attention.

A training recipe with pretraining + supervised fine-tune, interleaved ICL samples, and retrieval-based ICL to enable few-shot generalization.

Key Findings

Outperforms prior SOTA on many audio benchmarks (captioning, QA, classification).

NumbersClotho-v2 CIDEr 0.465 vs 0.441; ClothoAQA unanimous 86.9% vs 74.9%

Practical UseUse Audio Flamingo as a strong off-the-shelf model for audio captioning and many audio QA/classification tasks.

Evidence RefTable 2

Large gains on some perceptual tasks, notably audio quality and source prediction.

NumbersNSynth quality F1 66.7% vs 46.3% (+20.4pp); NSynth source ACC 78.7% vs 60.1% (+18.6pp)

Practical UseChoose Audio Flamingo for tasks that need fine-grained perceptual labels (quality, source).

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
CIDEr (Clotho-v2 captioning)0.4650.441+0.024Clotho-v2Table 2: Clotho-v2 CIDErTable 2
Accuracy86.9%74.9%+12.0ppClothoAQA (unanimous)Table 2: ClothoAQA unanimous accuracyTable 2

What To Try In 7 Days

Index a small local audio dataset with LAION-CLAP embeddings and test RAG + 8-shot prompts to evaluate few-shot gains.

Run the open-source demo/code to reproduce captioning and QA on a few in-house audio samples.

Fine-tune or SFT the chat model on a small set of domain dialogs to prototype an audio-aware assistant.

Agent Features

Memory
in-context learning (ICL) via interleaved examplesretrieval-augmented context (RAG)
Tool Use
Faiss for k-NN retrievalLAION-CLAP embeddings for retrieval
Architectures
decoder-only LM (OPT-IML-MAX-1.3B)gated xattn-dense cross-attentionsliding-window audio extractor (ClapCap)

Optimization Features

Token Efficiency

cross-attention conditioning gives linear complexity in audio windows (m) vs quadratic token prepend

Infra Optimization
SFT
Model Optimization
instruction-tuned decoder LM backbonetrainable audio transformation layers (3 self-attention layers)
Training Optimization
two-stage: pretraining (freeze LM) then supervised fine-tune (unfreeze LM)interleaved ICL samples and dataset-weighted samplingblock upper-triangular cross-attention masks for interleaved conditioning

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

AudioSet, AudioCaps, Clotho, LAION-CLAP (public datasets referenced in paper)

Risks & Boundaries

Limitations

Maximum supported audio per run is ~33.25s (16 windows); longer audio is cropped.

Some benchmarks show lower scores than specific baselines (e.g., CochlScene accuracy lower vs prior SOTA).

When Not To Use

For audio longer than ~33s without pre-segmentation.

When task demands advanced speech processing (ASR, diarization) beyond captioning/QA.

Failure Modes

Performance is dataset-dependent; few-shot gains vary by dataset.

Possible hallucinations in open-ended captions or QA when retrieval is noisy or absent.

Core Entities

Models

Audio FlamingoOPT-IML-MAX-1.3BClapCapLAION-CLAPPengiQwen-AudioLTUMU-LLaMARECAP

Metrics

CIDErAccuracyF1Bleu4Rouge-L

Datasets

AudioCapsClotho-v2ClothoAQANSynthFSD50kMedley-solos-DBAudioSetLP-MusicCapsMusicCapsMusicAVQACREMA-DRavdessUS8KGTZANAF-Dialogue-AudioSetSLAF-Dialogue-MusicCaps

Benchmarks

Audio captioning (CIDEr)AccuracyMulti-turn dialogue (CIDEr/Bleu4/Rouge-L)