OWL: teach audio LLMs room geometry (depth + RIR) so they localize sound better and explain why

September 30, 20257 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent improvements on synthetic benchmarks and a clear ablation path; real-world robustness is untested and needs validation before production.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 75%

Authors

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Links

Abstract / PDF

Why It Matters For Business

Geometry-aware audio models give more accurate and explainable direction/distance estimates from binaural audio, which helps applications like robot navigation, AR audio placement, and multi-source monitoring.

Who Should Care

Summary TLDR

The authors introduce SAGE, an audio encoder trained to align binaural audio with room geometry (panoramic depth + simulated room impulse responses) and OWL, an audio LLM that uses SAGE plus chain-of-thought (CoT) training. They release BiDepth (28K RIR-depth pairs, ≈1.1M QA items). Geometry supervision cuts localization errors (mean DoA error down ~11°) and improves QA/reasoning (up to ≈25% higher accuracy vs BAT). The system requires only binaural audio at inference and uses a three-stage curriculum to move from perception to multi-step spatial reasoning.

Problem Statement

Current audio LLMs can recognize sounds but fail at precise spatial reasoning because audio encoders ignore room geometry and models use single-step inference. This limits direction and distance accuracy and yields poor, uninterpretable reasoning in multi-source scenes.

Main Contribution

BiDepth: a synthetic, geometry-grounded dataset (≈1.1M QA pairs, 28K RIR+panoramic depth pairs) for perception and multi-step spatial reasoning.

SAGE: a geometry-aware binaural audio encoder trained with an auxiliary RIR reconstruction loss (uses depth at train time, audio-only at inference).

Key Findings

Geometry supervision improves angular localization.

Numbers11° reduction in mean angular error (DoA)

Practical UseTrain audio encoders with depth/RIR supervision to cut angular error and get tighter source direction estimates for tracking and AR.

Evidence RefAbstract; Section 6.2; Table 2

Geometry-aware training reduces distance errors.

Numbers33.5% drop in distance error rate (DER)

Practical UseAdding RIR/depth objectives helps estimate relative distances better — useful when coarse meters matter (e.g., robot navigation).

Evidence RefAbstract; Section 6.2; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
DoA mean angular error (MAE)−11°Spatial-AST / prior baselines11° reductionevaluated on BiDepth and SpatialSoundQAAbstract; Section 6.2; Table 2Table 2
Distance Error Rate (DER)−33.5%prior baselines33.5% relative decreaseevaluated on BiDepth / SpatialSoundQAAbstract; Section 6.2; Table 2Table 2

What To Try In 7 Days

Run SAGE-style auxiliary supervision: train your binaural encoder to predict RIR-related targets using synthetic RIRs and depth.

Plug geometry-aware audio embeddings into a frozen LLM with a small Q-Former and LoRA adapters for quick prototyping.

Add chain-of-thought supervision for relational queries (left/right, closer/farther) and compare binary accuracy vs single-step prompts.

Agent Features

Tool Use
LoRAQ-Former
Frameworks
LLaMA-2-7BResNet-18SoundSpaces
Architectures
Audio-Language ModelsTransformer encoderQ-Former

Optimization Features

Token Efficiency
Q-Former reduces sequence length to 64 query tokens
Infra Optimization
training reported on 4×A100 GPUs
Model Optimization
LoRAfrozen audio encoder to preserve learned geometry
System Optimization
AdamW optimizer, FP16 mixed-precision, cosine LR schedule
Training Optimization
three-stage curriculum: perception → relative geometry → CoTjoint audio + RIR reconstruction loss (η2 geometric weight)staged pretraining (AudioMAE init → joint fine-tuning)
Inference Optimization
depth and RIR only used at train time; inference uses audio-onlyQ-Former compresses audio features into fixed query tokens

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

BiDepth is simulation-based; real-world generalization is not demonstrated.

Evaluation focuses on single-turn QA; multi-turn dialog and interactive scenarios remain untested.

When Not To Use

When you need guaranteed real-world performance without domain adaptation

For interactive multi-turn audio dialogues (system only trained for single-turn QA)

Failure Modes

Localization degrades under unseen real-world RIRs or extreme reverberation beyond simulated RT60 range

Distance estimates may fail for sources beyond 10 m or in highly occluded scenes

Core Entities

Models

SAGEOWLLLaMA-2-7BQ-FormerBATSELDNetSpatial-ASTGemini (API baselines)

Metrics

mAPMean Angular Error (MAE)Error Rate at 20° (ER 20°)Distance Error Rate (DER)Accuracy

Datasets

BiDepthSpatialSoundQAAudioSetMatterport3DSoundSpaces

Benchmarks

BiDepth evaluation splitsSpatialSoundQA

Context Entities

Models

AudioFlamingo2RAVENVideoLLaMA2Gemini-1.5-ProGemini-2.5

Datasets

SpatialVLMSoundScape Pano-IR