Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
4
Why It Matters For Business
Audio Flamingo brings strong audio understanding and fast few-shot adaptation into a single, relatively small model—useful for building audio assistants, content moderation, audio search, and music tools without costly per-task fine-tuning.
Summary TLDR
Audio Flamingo is a Flamingo-style audio language model that fuses sliding-window audio features into a decoder LM via gated cross-attention. Trained on ~5.9M audio–text pairs with a two-stage pretrain + supervised fine-tune pipeline, it adds retrieval-augmented in‑context learning (RAG) and GPT-4–generated multi-turn dialogue data. The model (2.2B params) sets new state-of-the-art results across many audio benchmarks, improves few-shot accuracy with retrieved examples, and yields a strong chat model for multi-turn audio dialogue.
Problem Statement
Current LLMs mostly ignore non-speech sounds and non-verbal audio. Prior audio–LLMs either lose temporal detail or lack few-shot and multi-turn dialogue abilities. The paper aims to build a single model that understands diverse audio, adapts quickly via in-context examples and retrieval, and supports multi-turn chat without per-task fine-tuning.
Main Contribution
Audio Flamingo: a Flamingo-like architecture that conditions a decoder LM on sliding-window audio via gated cross-attention.
A training recipe with pretraining + supervised fine-tune, interleaved ICL samples, and retrieval-based ICL to enable few-shot generalization.
Two GPT-4–generated multi-turn dialogue datasets and a chat fine-tune that give strong multi-turn audio Q&A performance.
Key Findings
Outperforms prior SOTA on many audio benchmarks (captioning, QA, classification).
Large gains on some perceptual tasks, notably audio quality and source prediction.
Retrieval-augmented in-context learning (RAG+ICL) substantially improves few-shot performance.
Strong multi-turn dialogue capability after dialogue fine-tuning.
Can adapt to completely new labels via few-shot retrieval.
Results
CIDEr (Clotho-v2 captioning)
Accuracy
F1 (NSynth quality)
CIDEr (AudioCaps zero-shot)
CIDEr (AudioCaps retrieval-augmented 4-shot)
CIDEr (Multi-turn dialogue AF-Dialogue-A)
Accuracy
F1 approx (FSD50k multilabel)
Who Should Care
What To Try In 7 Days
Index a small local audio dataset with LAION-CLAP embeddings and test RAG + 8-shot prompts to evaluate few-shot gains.
Run the open-source demo/code to reproduce captioning and QA on a few in-house audio samples.
Fine-tune or SFT the chat model on a small set of domain dialogs to prototype an audio-aware assistant.
Agent Features
Memory
- in-context learning (ICL) via interleaved examples
- retrieval-augmented context (RAG)
Tool Use
- Faiss for k-NN retrieval
- LAION-CLAP embeddings for retrieval
Architectures
- decoder-only LM (OPT-IML-MAX-1.3B)
- gated xattn-dense cross-attention
- sliding-window audio extractor (ClapCap)
Optimization Features
Token Efficiency
- cross-attention conditioning gives linear complexity in audio windows (m) vs quadratic token prepend
Infra Optimization
- SFT
Model Optimization
- instruction-tuned decoder LM backbone
- trainable audio transformation layers (3 self-attention layers)
Training Optimization
- two-stage: pretraining (freeze LM) then supervised fine-tune (unfreeze LM)
- interleaved ICL samples and dataset-weighted sampling
- block upper-triangular cross-attention masks for interleaved conditioning
Reproducibility
Data Urls
- AudioSet, AudioCaps, Clotho, LAION-CLAP (public datasets referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Maximum supported audio per run is ~33.25s (16 windows); longer audio is cropped.
- Some benchmarks show lower scores than specific baselines (e.g., CochlScene accuracy lower vs prior SOTA).
- Model focuses on general audio and music; complex speech tasks beyond transcription are not fully addressed.
- Training stability required freezing the audio encoder; unfreezing caused instability in early experiments.
When Not To Use
- For audio longer than ~33s without pre-segmentation.
- When task demands advanced speech processing (ASR, diarization) beyond captioning/QA.
- If strict computational/latency limits prevent using a 2.2B-model with cross-attention and retrieval.
Failure Modes
- Performance is dataset-dependent; few-shot gains vary by dataset.
- Possible hallucinations in open-ended captions or QA when retrieval is noisy or absent.
- Biases and gaps from heterogeneous training data can affect rare labels.
- Dialogue behavior depends on quality of synthetic GPT-4 dialogues; generated data may inject artifacts.
Core Entities
Models
- Audio Flamingo
- OPT-IML-MAX-1.3B
- ClapCap
- LAION-CLAP
- Pengi
- Qwen-Audio
- LTU
- MU-LLaMA
- RECAP
Metrics
- CIDEr
- Accuracy
- F1
- Bleu4
- Rouge-L
Datasets
- AudioCaps
- Clotho-v2
- ClothoAQA
- NSynth
- FSD50k
- Medley-solos-DB
- AudioSet
- LP-MusicCaps
- MusicCaps
- MusicAVQA
- CREMA-D
- Ravdess
- US8K
- GTZAN
- AF-Dialogue-AudioSetSL
- AF-Dialogue-MusicCaps
Benchmarks
- Audio captioning (CIDEr)
- Accuracy
- Multi-turn dialogue (CIDEr/Bleu4/Rouge-L)

