WhisperInject: covertly embed model-native harmful text into benign audio to jailbreak multimodal LLMs

August 5, 20257 min

Overview

Production Readiness

1

Novelty Score

0.8

Cost Impact Score

0.6

Citation Count

0

Authors

Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, Bodam Kim, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

Links

Abstract / PDF

Why It Matters For Business

Voice interfaces can be hijacked to make models produce harmful or policy-violating text without changing what humans hear, so companies must add audio-level safety, monitoring, and access controls.

Summary TLDR

This paper presents WHISPERINJECT, a two-stage white-box attack that first uses RL-PGD (a reward-guided PGD) to make an audio-language model produce a harmful text response, then hides that model-native harmful response as a tiny perturbation inside a benign audio clip (e.g., “How's the weather today?”). Evaluated on five recent multimodal models and two jailbreak benchmarks, the method yields high end-to-end success (example: AdvBench StrongREJECT avg ≈70.5%), while keeping audio intelligible to humans (STOI ≥0.59 and 100% human content recognition in tests). The attack assumes gradient access and skips over-the-air and compression tests, so real-world transfer is unproven.

Problem Statement

Audio inputs are becoming a common interface for LLMs, but existing safety checks focus on text and miss audio-native attacks. The paper asks: can an adversary embed a covert, human-inaudible payload into benign audio so an aligned audio-language model will generate harmful content?

Main Contribution

A new two-stage audio jailbreak: Stage 1 uses RL-PGD to discover harmful responses the model will actually produce; Stage 2 injects those discovered responses into benign audio via gradient-optimized perturbations.

Empirical threat demonstration on five modern multimodal models showing high attack success while preserving human-perceived audio content.

Evaluation across multiple harm judges (StrongREJECT, LlamaGuard, JailbreakEval), objective and human intelligibility tests, and perturbation-budget analysis.

Key Findings

Stage 1 (RL-PGD) reliably finds model-native harmful payloads.

NumbersAdvBench 86.7% avg, JailbreakBench 67.0% avg

End-to-end adversarial audio triggers harmful outputs at substantial rates.

NumbersAdvBench StrongREJECT avg 70.5%; JailbreakBench StrongREJECT avg 46.4%

Attack success persists while preserving human intelligibility.

NumbersSTOI ≥ 0.591; 100% human content recognition (tested)

Attack relies on white-box access and clean signal assumptions.

NumbersAssumes gradient access; no over-the-air/compression experiments

Per-case compute is modest but non-trivial.

Numbers≈10–15 minutes per prompt using an NVIDIA H100

Results

Stage 1 native payload discovery (RL-PGD)

ValueAdvBench 0.867 avg; JailbreakBench 0.670 avg

End-to-end ASR (StrongREJECT)

ValueAdvBench avg 0.705; JailbreakBench avg 0.464

End-to-end ASR (LlamaGuard)

ValueJailbreakBench avg 0.748; AdvBench per-model up to 1.00

Perturbation budget impact (Qwen2.5-Omni-3B)

Valueε=0.2 SR 0.87 LG 0.913 STOI 0.696; ε=0.3 SR 0.652 LG 0.826 STOI 0.648; ε=0.5 SR 0.826 LG 0.913 STOI 0.591

Human content recognition (subjective)

Value100% recognized carrier phrase in human tests

Who Should Care

What To Try In 7 Days

Audit audio input paths for any place gradient or local model access exists.

Run simple adversarial probes (benign carrier + small perturbation) on in-house models to measure ASR.

Add an audio-level detection step (e.g., spectral anomaly or ML-based detector) before text filtering.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • White-box threat model requires gradient access; black-box transfer was not tested.
  • No over-the-air, compression, or background-noise experiments were reported.
  • Evaluation uses automated judges and limited human tests; judge bias and thresholds influence ASR.

When Not To Use

  • Estimating risk for fully remote black-box services without gradient access.
  • Assuming identical success over real-world delivery channels (streaming, compression, phone).
  • Using results as definitive metric for all audio carriers and languages beyond tests.

Failure Modes

  • Over-the-air noise, compression, or playback distortions could break the embedded payload.
  • Judge model or safety classifier changes will alter RL-PGD reward signals and outcomes.
  • Shorter carrier phrases reduce available perturbation bandwidth and may lower ASR.

Core Entities

Models

  • Qwen2.5-Omni-3B
  • Qwen2.5-Omni-7B
  • Phi-4-Multimodal
  • Gemma-3n-2B
  • Gemma-3n-4B

Metrics

  • Attack Success Rate (ASR)
  • STOI
  • Cosine similarity (all-MiniLM-L6-v2)
  • Judge harmfulness score (GPT-4o-mini)

Datasets

  • AdvBench
  • JailbreakBench

Benchmarks

  • StrongREJECT
  • LlamaGuard
  • JailbreakEval