Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
6
Why It Matters For Business
WavJourney turns natural language briefs into finished mixed audio by chaining existing specialist models, reducing the need to build large unified audio models and enabling faster prototyping of audio content.
Summary TLDR
WavJourney uses an LLM (GPT-4) to turn a text instruction into a structured "audio script" (JSON list of speech/music/effects), compiles that script into a short Python program, and then calls specialist audio models (TTS, text-to-music, text-to-audio) to generate and mix the pieces. No extra training is needed. On public datasets (AudioCaps, Clotho) and a new multi-genre storytelling benchmark the system improves subjective quality vs. modern baselines and enables iterative human-in-the-loop edits. Major tradeoffs: higher run-time and dependence on external models like GPT-4 and Bark.
Problem Statement
Existing audio generators target narrow tasks (speech, music, or effects) and struggle to produce coherent, multi-element audio scenes from a single text instruction. The problem is to combine specialist models into a controllable, interpretable pipeline that produces composed audio stories from text without retraining.
Main Contribution
A pipeline that prompts an LLM to produce a structured audio script, compiles it to program code, and executes specialist audio models to synthesize composed audio.
Empirical results showing improved subjective quality over AudioGen and AudioLDM on AudioCaps and a new multi-genre storytelling benchmark.
A multi-genre storytelling benchmark and subjective metrics (engagement, creativity, relevance, emotion, pacing) plus demos of multi-turn human-machine co-creation.
Key Findings
WavJourney beats AudioGen and AudioLDM in human subjective scores on AudioCaps.
WavJourney achieves state-of-the-art on Clotho in both objective and subjective metrics.
On AudioCaps WavJourney marginally exceeded ground-truth in overall impression.
Human Turing tests show generated audio is often judged real but still below real audio.
Hand-crafted compiler is far faster and more reliable than LLM-based code generation.
WavJourney adds run-time cost versus single-model baselines.
Results
AudioCaps OVL
AudioCaps REL
Clotho FAD
Clotho IS
Turing perceived-as-real (AudioCaps)
Who Should Care
What To Try In 7 Days
Prototype an LLM-written audio script for a short ad or podcast scene and compile it to call TTS and music models.
Compare human subjective ratings for scripted vs. end-to-end audio on 20 captions from your domain.
Replace GPT-4 with an open LLM and measure compilation reliability and fixpoint errors.
Agent Features
Memory
- no explicit long-term memory described
- uses script as short-term state
Planning
- task decomposition into audio elements
- conversion of script nodes into execution plan
Tool Use
- calls text-to-speech
- calls text-to-music
- calls text-to-audio
- audio mix/concatenate utilities
Frameworks
- prompt templates
- Python execution
Is Agentic
true
Architectures
- LLM controller (script writer + compiler)
- modular audio expert stack
Collaboration
- human-in-the-loop multi-round editing
- voice preset assignment via prompts
Optimization Features
Infra Optimization
- recommend parallel calls to reduce wall-clock time (not implemented)
Model Optimization
- none (uses off-the-shelf models)
System Optimization
- hand-crafted compiler to avoid LLM-based code instability
Training Optimization
- training-free orchestration; no fine-tuning of LLM or audio models
Inference Optimization
- none reported; authors note parallelism as future work
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Extensibility: script+compiler design is rigid and needs engineering to add new functions.
- Artificial composition: remixed audio can differ from natural multi-track recordings, especially music.
- Efficiency: script writing + multiple model calls roughly double inference time vs single-model baselines.
- Proprietary dependence: GPT-4 and Bark used, limiting reproducible open-source deployment.
When Not To Use
- When you need real-time or low-latency generation.
- When music requires complex multi-track composition and professional mixing.
- When you must avoid proprietary APIs or require fully open-source stacks.
Failure Modes
- LLM fails to follow JSON format or hallucinates, breaking compilation.
- TTS voice mismatch or unnatural prosody revealing synthetic speech in Turing tests.
- Compound audio artifacts from naive mixing (phase, timing) reduce realism.
- Longer pipelines increase chances of network/API timeouts and execution errors.
Core Entities
Models
- GPT-4
- Llama2-70B-Chat
- AudioGen
- AudioLDM
- MusicGen
- Bark (zero-shot TTS)
- VoiceFixer
Metrics
- FAD
- KL
- Inception Score (IS)
- OVL (Overall Impression)
- REL (Audio-Text Relation)
- Turing perceived-as-real rate
Datasets
- AudioCaps
- Clotho
- Freesound
Benchmarks
- AudioCaps
- Clotho
- Multi-genre storytelling benchmark (this paper)

