Use an LLM to write a structured audio script, compile it to code, and run specialist audio models to generate narrated, mixed audio scenes.

July 26, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

6

Authors

Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

Links

Abstract / PDF

Why It Matters For Business

WavJourney turns natural language briefs into finished mixed audio by chaining existing specialist models, reducing the need to build large unified audio models and enabling faster prototyping of audio content.

Summary TLDR

WavJourney uses an LLM (GPT-4) to turn a text instruction into a structured "audio script" (JSON list of speech/music/effects), compiles that script into a short Python program, and then calls specialist audio models (TTS, text-to-music, text-to-audio) to generate and mix the pieces. No extra training is needed. On public datasets (AudioCaps, Clotho) and a new multi-genre storytelling benchmark the system improves subjective quality vs. modern baselines and enables iterative human-in-the-loop edits. Major tradeoffs: higher run-time and dependence on external models like GPT-4 and Bark.

Problem Statement

Existing audio generators target narrow tasks (speech, music, or effects) and struggle to produce coherent, multi-element audio scenes from a single text instruction. The problem is to combine specialist models into a controllable, interpretable pipeline that produces composed audio stories from text without retraining.

Main Contribution

A pipeline that prompts an LLM to produce a structured audio script, compiles it to program code, and executes specialist audio models to synthesize composed audio.

Empirical results showing improved subjective quality over AudioGen and AudioLDM on AudioCaps and a new multi-genre storytelling benchmark.

A multi-genre storytelling benchmark and subjective metrics (engagement, creativity, relevance, emotion, pacing) plus demos of multi-turn human-machine co-creation.

Key Findings

WavJourney beats AudioGen and AudioLDM in human subjective scores on AudioCaps.

NumbersOVL 3.75 vs AudioGen 3.56; REL 3.74 vs 3.52

WavJourney achieves state-of-the-art on Clotho in both objective and subjective metrics.

NumbersClotho FAD 1.75 (↓), IS 9.15 (↑); OVL 3.61 vs baselines 3.41

On AudioCaps WavJourney marginally exceeded ground-truth in overall impression.

NumbersOVL 3.75 vs Ground Truth 3.73

Human Turing tests show generated audio is often judged real but still below real audio.

NumbersPerceived-as-real 53.8% (AudioCaps) vs GT 65.8%

Hand-crafted compiler is far faster and more reliable than LLM-based code generation.

NumbersHand-crafted EER 0% and 0.03s vs GPT4-based EER 56% and 63.16s

WavJourney adds run-time cost versus single-model baselines.

NumbersScript write 8.1s + audio gen 45.3s vs AudioGen audio gen 23.0s

Results

AudioCaps OVL

Value3.75

BaselineAudioGen 3.56

AudioCaps REL

Value3.74

BaselineAudioGen 3.52

Clotho FAD

Value1.75

BaselineAudioGen 2.55

Clotho IS

Value9.15

BaselineAudioGen 7.41

Turing perceived-as-real (AudioCaps)

Value53.8%

BaselineGround Truth 65.8%

Who Should Care

What To Try In 7 Days

Prototype an LLM-written audio script for a short ad or podcast scene and compile it to call TTS and music models.

Compare human subjective ratings for scripted vs. end-to-end audio on 20 captions from your domain.

Replace GPT-4 with an open LLM and measure compilation reliability and fixpoint errors.

Agent Features

Memory

  • no explicit long-term memory described
  • uses script as short-term state

Planning

  • task decomposition into audio elements
  • conversion of script nodes into execution plan

Tool Use

  • calls text-to-speech
  • calls text-to-music
  • calls text-to-audio
  • audio mix/concatenate utilities

Frameworks

  • prompt templates
  • Python execution

Is Agentic

true

Architectures

  • LLM controller (script writer + compiler)
  • modular audio expert stack

Collaboration

  • human-in-the-loop multi-round editing
  • voice preset assignment via prompts

Optimization Features

Infra Optimization

  • recommend parallel calls to reduce wall-clock time (not implemented)

Model Optimization

  • none (uses off-the-shelf models)

System Optimization

  • hand-crafted compiler to avoid LLM-based code instability

Training Optimization

  • training-free orchestration; no fine-tuning of LLM or audio models

Inference Optimization

  • none reported; authors note parallelism as future work

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Extensibility: script+compiler design is rigid and needs engineering to add new functions.
  • Artificial composition: remixed audio can differ from natural multi-track recordings, especially music.
  • Efficiency: script writing + multiple model calls roughly double inference time vs single-model baselines.
  • Proprietary dependence: GPT-4 and Bark used, limiting reproducible open-source deployment.

When Not To Use

  • When you need real-time or low-latency generation.
  • When music requires complex multi-track composition and professional mixing.
  • When you must avoid proprietary APIs or require fully open-source stacks.

Failure Modes

  • LLM fails to follow JSON format or hallucinates, breaking compilation.
  • TTS voice mismatch or unnatural prosody revealing synthetic speech in Turing tests.
  • Compound audio artifacts from naive mixing (phase, timing) reduce realism.
  • Longer pipelines increase chances of network/API timeouts and execution errors.

Core Entities

Models

  • GPT-4
  • Llama2-70B-Chat
  • AudioGen
  • AudioLDM
  • MusicGen
  • Bark (zero-shot TTS)
  • VoiceFixer

Metrics

  • FAD
  • KL
  • Inception Score (IS)
  • OVL (Overall Impression)
  • REL (Audio-Text Relation)
  • Turing perceived-as-real rate

Datasets

  • AudioCaps
  • Clotho
  • Freesound

Benchmarks

  • AudioCaps
  • Clotho
  • Multi-genre storytelling benchmark (this paper)