OWL: teach audio LLMs room geometry (depth + RIR) so they localize sound better and explain why

September 30, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.75

Cost Impact Score

0.5

Citation Count

0

Authors

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Links

Abstract / PDF

Why It Matters For Business

Geometry-aware audio models give more accurate and explainable direction/distance estimates from binaural audio, which helps applications like robot navigation, AR audio placement, and multi-source monitoring.

Summary TLDR

The authors introduce SAGE, an audio encoder trained to align binaural audio with room geometry (panoramic depth + simulated room impulse responses) and OWL, an audio LLM that uses SAGE plus chain-of-thought (CoT) training. They release BiDepth (28K RIR-depth pairs, ≈1.1M QA items). Geometry supervision cuts localization errors (mean DoA error down ~11°) and improves QA/reasoning (up to ≈25% higher accuracy vs BAT). The system requires only binaural audio at inference and uses a three-stage curriculum to move from perception to multi-step spatial reasoning.

Problem Statement

Current audio LLMs can recognize sounds but fail at precise spatial reasoning because audio encoders ignore room geometry and models use single-step inference. This limits direction and distance accuracy and yields poor, uninterpretable reasoning in multi-source scenes.

Main Contribution

BiDepth: a synthetic, geometry-grounded dataset (≈1.1M QA pairs, 28K RIR+panoramic depth pairs) for perception and multi-step spatial reasoning.

SAGE: a geometry-aware binaural audio encoder trained with an auxiliary RIR reconstruction loss (uses depth at train time, audio-only at inference).

OWL: an audio LLM that combines SAGE, a Q-Former projector, LLaMA-2-7B with LoRA, and chain-of-thought curriculum training for interpretable spatial reasoning.

Key Findings

Geometry supervision improves angular localization.

Numbers11° reduction in mean angular error (DoA)

Geometry-aware training reduces distance errors.

Numbers33.5% drop in distance error rate (DER)

Chain-of-Thought (CoT) plus curriculum lifts reasoning accuracy.

Numbers≈11.3% absolute gain on reasoning accuracy with CoT supervision

Large geometry-grounded training data was created.

Numbers1.1M QA pairs; 28K RIR-depth pairs

Results

DoA mean angular error (MAE)

Value−11°

BaselineSpatial-AST / prior baselines

Distance Error Rate (DER)

Value−33.5%

Baselineprior baselines

Event detection (mAP)

Value+1.7% (approx)

BaselineSpatial-AST / SELDNet

Accuracy

Value76.53% (final model; Detection 79.04%, Direction 86.76%)

BaselineBAT and open-source baselines

BiDepth dataset size

Value≈1.1M QA pairs; 28K RIR-depth pairs

Baselinelarger than prior spatial audio corpora

Who Should Care

What To Try In 7 Days

Run SAGE-style auxiliary supervision: train your binaural encoder to predict RIR-related targets using synthetic RIRs and depth.

Plug geometry-aware audio embeddings into a frozen LLM with a small Q-Former and LoRA adapters for quick prototyping.

Add chain-of-thought supervision for relational queries (left/right, closer/farther) and compare binary accuracy vs single-step prompts.

Agent Features

Tool Use

  • LoRA
  • Q-Former

Frameworks

  • LLaMA-2-7B
  • ResNet-18
  • SoundSpaces

Architectures

  • Audio-Language Models
  • Transformer encoder
  • Q-Former

Optimization Features

Token Efficiency

  • Q-Former reduces sequence length to 64 query tokens

Infra Optimization

  • training reported on 4×A100 GPUs

Model Optimization

  • LoRA
  • frozen audio encoder to preserve learned geometry

System Optimization

  • AdamW optimizer, FP16 mixed-precision, cosine LR schedule

Training Optimization

  • three-stage curriculum: perception → relative geometry → CoT
  • joint audio + RIR reconstruction loss (η2 geometric weight)
  • staged pretraining (AudioMAE init → joint fine-tuning)

Inference Optimization

  • depth and RIR only used at train time; inference uses audio-only
  • Q-Former compresses audio features into fixed query tokens

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • BiDepth is simulation-based; real-world generalization is not demonstrated.
  • Evaluation focuses on single-turn QA; multi-turn dialog and interactive scenarios remain untested.
  • Elevation coverage is biased toward horizontal plane, so overhead/underfloor cases are underrepresented.

When Not To Use

  • When you need guaranteed real-world performance without domain adaptation
  • For interactive multi-turn audio dialogues (system only trained for single-turn QA)
  • In highly reverberant or unusual outdoor acoustics not represented in Matterport3D simulations

Failure Modes

  • Localization degrades under unseen real-world RIRs or extreme reverberation beyond simulated RT60 range
  • Distance estimates may fail for sources beyond 10 m or in highly occluded scenes
  • Reasoning may be brittle if earlier perceptual steps (DoA/distance) are incorrect

Core Entities

Models

  • SAGE
  • OWL
  • LLaMA-2-7B
  • Q-Former
  • BAT
  • SELDNet
  • Spatial-AST
  • Gemini (API baselines)

Metrics

  • mAP
  • Mean Angular Error (MAE)
  • Error Rate at 20° (ER 20°)
  • Distance Error Rate (DER)
  • Accuracy

Datasets

  • BiDepth
  • SpatialSoundQA
  • AudioSet
  • Matterport3D
  • SoundSpaces

Benchmarks

  • BiDepth evaluation splits
  • SpatialSoundQA

Context Entities

Models

  • AudioFlamingo2
  • RAVEN
  • VideoLLaMA2
  • Gemini-1.5-Pro
  • Gemini-2.5

Datasets

  • SpatialVLM
  • SoundScape Pano-IR