Use an LLM to write a structured audio script, compile it to code, and run specialist audio models to generate narrated, mixed audio scenes.

July 26, 20238 min

Overview

Decision SnapshotNeeds Validation

Good subjective evidence across two benchmarks and ablations support claims, but system depends on proprietary LLMs and extra inference cost; reliability varies with LLM choice.

Citations6

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WavJourney turns natural language briefs into finished mixed audio by chaining existing specialist models, reducing the need to build large unified audio models and enabling faster prototyping of audio content.

Who Should Care

Summary TLDR

WavJourney uses an LLM (GPT-4) to turn a text instruction into a structured "audio script" (JSON list of speech/music/effects), compiles that script into a short Python program, and then calls specialist audio models (TTS, text-to-music, text-to-audio) to generate and mix the pieces. No extra training is needed. On public datasets (AudioCaps, Clotho) and a new multi-genre storytelling benchmark the system improves subjective quality vs. modern baselines and enables iterative human-in-the-loop edits. Major tradeoffs: higher run-time and dependence on external models like GPT-4 and Bark.

Problem Statement

Existing audio generators target narrow tasks (speech, music, or effects) and struggle to produce coherent, multi-element audio scenes from a single text instruction. The problem is to combine specialist models into a controllable, interpretable pipeline that produces composed audio stories from text without retraining.

Main Contribution

A pipeline that prompts an LLM to produce a structured audio script, compiles it to program code, and executes specialist audio models to synthesize composed audio.

Empirical results showing improved subjective quality over AudioGen and AudioLDM on AudioCaps and a new multi-genre storytelling benchmark.

Key Findings

WavJourney beats AudioGen and AudioLDM in human subjective scores on AudioCaps.

NumbersOVL 3.75 vs AudioGen 3.56; REL 3.74 vs 3.52

Practical UseUse the script+specialist-model approach to get better perceived audio quality from captions than end-to-end baselines on AudioCaps.

Evidence RefTable 3 (AudioCaps subjective)

WavJourney achieves state-of-the-art on Clotho in both objective and subjective metrics.

NumbersClotho FAD 1.75 (↓), IS 9.15 (↑); OVL 3.61 vs baselines 3.41

Practical UseFor high-quality caption-based generation on curated datasets, orchestrating models via an LLM can yield measurable fidelity and diversity gains.

Evidence RefTable 3 (Clotho)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AudioCaps OVL3.75AudioGen 3.56+0.19AudioCaps (subjective)Table 3 leftTable 3 (AudioCaps subjective)
AudioCaps REL3.74AudioGen 3.52+0.22AudioCaps (subjective)Table 3 leftTable 3 (AudioCaps subjective)

What To Try In 7 Days

Prototype an LLM-written audio script for a short ad or podcast scene and compile it to call TTS and music models.

Compare human subjective ratings for scripted vs. end-to-end audio on 20 captions from your domain.

Replace GPT-4 with an open LLM and measure compilation reliability and fixpoint errors.

Agent Features

Memory
no explicit long-term memory describeduses script as short-term state
Planning
task decomposition into audio elementsconversion of script nodes into execution plan
Tool Use
calls text-to-speechcalls text-to-musiccalls text-to-audioaudio mix/concatenate utilities
Frameworks
prompt templatesPython execution
Is Agentic

Yes

Architectures
LLM controller (script writer + compiler)modular audio expert stack
Collaboration
human-in-the-loop multi-round editingvoice preset assignment via prompts

Optimization Features

Infra Optimization
recommend parallel calls to reduce wall-clock time (not implemented)
Model Optimization
none (uses off-the-shelf models)
System Optimization
hand-crafted compiler to avoid LLM-based code instability
Training Optimization
training-free orchestration; no fine-tuning of LLM or audio models
Inference Optimization
none reported; authors note parallelism as future work

Reproducibility

Risks & Boundaries

Limitations

Extensibility: script+compiler design is rigid and needs engineering to add new functions.

Artificial composition: remixed audio can differ from natural multi-track recordings, especially music.

When Not To Use

When you need real-time or low-latency generation.

When music requires complex multi-track composition and professional mixing.

Failure Modes

LLM fails to follow JSON format or hallucinates, breaking compilation.

TTS voice mismatch or unnatural prosody revealing synthetic speech in Turing tests.

Core Entities

Models

GPT-4Llama2-70B-ChatAudioGenAudioLDMMusicGenBark (zero-shot TTS)VoiceFixer

Metrics

FADKLInception Score (IS)OVL (Overall Impression)REL (Audio-Text Relation)Turing perceived-as-real rate

Datasets

AudioCapsClothoFreesound

Benchmarks

AudioCapsClothoMulti-genre storytelling benchmark (this paper)