WhisperInject: covertly embed model-native harmful text into benign audio to jailbreak multimodal LLMs

Overview

Decision SnapshotReady For Pilot

The method is novel for audio-language models and shows measurable success on recent models, but it requires white-box gradient access and lacks over-the-air and compression tests, lowering real-world readiness.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 100%

Novelty: 80%

Authors

Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, Bodam Kim, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

Links

Abstract / PDF

Why It Matters For Business

Voice interfaces can be hijacked to make models produce harmful or policy-violating text without changing what humans hear, so companies must add audio-level safety, monitoring, and access controls.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper presents WHISPERINJECT, a two-stage white-box attack that first uses RL-PGD (a reward-guided PGD) to make an audio-language model produce a harmful text response, then hides that model-native harmful response as a tiny perturbation inside a benign audio clip (e.g., “How's the weather today?”). Evaluated on five recent multimodal models and two jailbreak benchmarks, the method yields high end-to-end success (example: AdvBench StrongREJECT avg ≈70.5%), while keeping audio intelligible to humans (STOI ≥0.59 and 100% human content recognition in tests). The attack assumes gradient access and skips over-the-air and compression tests, so real-world transfer is unproven.

Problem Statement

Audio inputs are becoming a common interface for LLMs, but existing safety checks focus on text and miss audio-native attacks. The paper asks: can an adversary embed a covert, human-inaudible payload into benign audio so an aligned audio-language model will generate harmful content?

Main Contribution

A new two-stage audio jailbreak: Stage 1 uses RL-PGD to discover harmful responses the model will actually produce; Stage 2 injects those discovered responses into benign audio via gradient-optimized perturbations.

Empirical threat demonstration on five modern multimodal models showing high attack success while preserving human-perceived audio content.

Key Findings

Stage 1 (RL-PGD) reliably finds model-native harmful payloads.

NumbersAdvBench 86.7% avg, JailbreakBench 67.0% avg

Practical UseAutomate payload selection: defenders should assume attackers can find harmful replies without hand-crafting targets.

Evidence RefTable 2; Sec. 4.2

End-to-end adversarial audio triggers harmful outputs at substantial rates.

NumbersAdvBench StrongREJECT avg 70.5%; JailbreakBench StrongREJECT avg 46.4%

Practical UseAudio channels can bypass text filters on current multimodal models; prioritize audio-level safety checks.

Evidence RefTable 1; Sec. 4.1 and 4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Stage 1 native payload discovery (RL-PGD)	AdvBench 0.867 avg; JailbreakBench 0.670 avg	—	—	AdvBench / JailbreakBench	Table 2; Sec. 4.2	Table 2
End-to-end ASR (StrongREJECT)	AdvBench avg 0.705; JailbreakBench avg 0.464	—	—	AdvBench / JailbreakBench	Table 1; Sec. 4.1 and 4.4	Table 1

What To Try In 7 Days

Audit audio input paths for any place gradient or local model access exists.

Run simple adversarial probes (benign carrier + small perturbation) on in-house models to measure ASR.

Add an audio-level detection step (e.g., spectral anomaly or ML-based detector) before text filtering.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

White-box threat model requires gradient access; black-box transfer was not tested.

No over-the-air, compression, or background-noise experiments were reported.

When Not To Use

Estimating risk for fully remote black-box services without gradient access.

Assuming identical success over real-world delivery channels (streaming, compression, phone).

Failure Modes

Over-the-air noise, compression, or playback distortions could break the embedded payload.

Judge model or safety classifier changes will alter RL-PGD reward signals and outcomes.

Core Entities

Models

Qwen2.5-Omni-3BQwen2.5-Omni-7BPhi-4-MultimodalGemma-3n-2BGemma-3n-4B

Metrics

Attack Success Rate (ASR)STOICosine similarity (all-MiniLM-L6-v2)Judge harmfulness score (GPT-4o-mini)

Datasets

AdvBenchJailbreakBench

Benchmarks

StrongREJECTLlamaGuardJailbreakEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Stage 1 (RL-PGD) reliably finds model-native harmful payloads.

End-to-end adversarial audio triggers harmful outputs at substantial rates.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ObjexMT: test if LLM "judges" can recover hidden objectives and know when they're confident

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

Add intent-aware JWTs and a client shim to stop agents from misusing shared OAuth tokens

Key finding

Judge-free, multilingual jailbreak stress test for 12 South Asian languages with 45k+ prompts

Key finding

Many jailbreak detections are hallucinations — BABYBLUE validates which outputs are truly harmful

Key finding