Overview
Production Readiness
1
Novelty Score
0.8
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Voice interfaces can be hijacked to make models produce harmful or policy-violating text without changing what humans hear, so companies must add audio-level safety, monitoring, and access controls.
Summary TLDR
This paper presents WHISPERINJECT, a two-stage white-box attack that first uses RL-PGD (a reward-guided PGD) to make an audio-language model produce a harmful text response, then hides that model-native harmful response as a tiny perturbation inside a benign audio clip (e.g., “How's the weather today?”). Evaluated on five recent multimodal models and two jailbreak benchmarks, the method yields high end-to-end success (example: AdvBench StrongREJECT avg ≈70.5%), while keeping audio intelligible to humans (STOI ≥0.59 and 100% human content recognition in tests). The attack assumes gradient access and skips over-the-air and compression tests, so real-world transfer is unproven.
Problem Statement
Audio inputs are becoming a common interface for LLMs, but existing safety checks focus on text and miss audio-native attacks. The paper asks: can an adversary embed a covert, human-inaudible payload into benign audio so an aligned audio-language model will generate harmful content?
Main Contribution
A new two-stage audio jailbreak: Stage 1 uses RL-PGD to discover harmful responses the model will actually produce; Stage 2 injects those discovered responses into benign audio via gradient-optimized perturbations.
Empirical threat demonstration on five modern multimodal models showing high attack success while preserving human-perceived audio content.
Evaluation across multiple harm judges (StrongREJECT, LlamaGuard, JailbreakEval), objective and human intelligibility tests, and perturbation-budget analysis.
Key Findings
Stage 1 (RL-PGD) reliably finds model-native harmful payloads.
End-to-end adversarial audio triggers harmful outputs at substantial rates.
Attack success persists while preserving human intelligibility.
Attack relies on white-box access and clean signal assumptions.
Per-case compute is modest but non-trivial.
Results
Stage 1 native payload discovery (RL-PGD)
End-to-end ASR (StrongREJECT)
End-to-end ASR (LlamaGuard)
Perturbation budget impact (Qwen2.5-Omni-3B)
Human content recognition (subjective)
Who Should Care
What To Try In 7 Days
Audit audio input paths for any place gradient or local model access exists.
Run simple adversarial probes (benign carrier + small perturbation) on in-house models to measure ASR.
Add an audio-level detection step (e.g., spectral anomaly or ML-based detector) before text filtering.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- White-box threat model requires gradient access; black-box transfer was not tested.
- No over-the-air, compression, or background-noise experiments were reported.
- Evaluation uses automated judges and limited human tests; judge bias and thresholds influence ASR.
When Not To Use
- Estimating risk for fully remote black-box services without gradient access.
- Assuming identical success over real-world delivery channels (streaming, compression, phone).
- Using results as definitive metric for all audio carriers and languages beyond tests.
Failure Modes
- Over-the-air noise, compression, or playback distortions could break the embedded payload.
- Judge model or safety classifier changes will alter RL-PGD reward signals and outcomes.
- Shorter carrier phrases reduce available perturbation bandwidth and may lower ASR.
Core Entities
Models
- Qwen2.5-Omni-3B
- Qwen2.5-Omni-7B
- Phi-4-Multimodal
- Gemma-3n-2B
- Gemma-3n-4B
Metrics
- Attack Success Rate (ASR)
- STOI
- Cosine similarity (all-MiniLM-L6-v2)
- Judge harmfulness score (GPT-4o-mini)
Datasets
- AdvBench
- JailbreakBench
Benchmarks
- StrongREJECT
- LlamaGuard
- JailbreakEval

