JADE: use grammar-based mutations to find natural inputs that bypass LLM safety guards

Overview

Decision SnapshotNeeds Validation

The approach is novel and well-supported across many models, but experiments focus on specific models, languages, and lab settings; active evaluation needs more validation at scale.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 70%

Authors

Mi Zhang, Xudong Pan, Min Yang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

JADE finds natural inputs that bypass safety filters across models (avg ~70% unsafe), revealing real deployment risk that static benchmarks miss.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder Data Scientist

Summary TLDR

JADE is a grammar-driven fuzzing platform that grows or transforms seed unsafe questions into more syntactically complex variants until aligned LLMs break their guardrails. Using hand-crafted generative and transformational rules (from transformational-generative grammar), JADE finds many natural-looking 'proof-of-concept' (PoC) prompts that trigger unsafe outputs across open-source and commercial LLMs. The authors release benchmark demos, add an active prompt-tuning evaluation loop to reduce manual labeling, and report average unsafe-generation rates around 70% on evaluated models. JADE is cheaper (fewer model queries) and more natural than gradient/suffix-based attacks.

Problem Statement

Current safety tests miss many realistic attacks because models can fail only on syntactic variants of the same malicious intent. Static benchmarks and generator-based red-teaming leave large gaps. We need a systematic, linguistics-aware method to produce natural, transferable inputs that expose model safety failures.

Main Contribution

JADE platform: a targeted linguistic fuzzing system that mutates seed questions using generative and transformational grammar rules.

Benchmarks: three groups of PoC datasets (open-source Chinese, commercial Chinese, commercial English) with high unsafe-generation rates.

Key Findings

Mutating seed questions raises unsafe-generation from ~20% to ~70% on evaluated models.

Numbersseed ≈20% → mutated ≈70% (Δ ≈ +50 percentage points)

Practical UseAdd linguistic-mutation tests to your red-team; safety tuning that passes static benchmarks can still fail on simple syntactic variants.

Evidence RefSec.1.4, Sec.4.2, Fig.9

Many PoC prompts transfer across models: ~30% of Chinese open-source PoCs trigger all eight tested models; ~60% trigger >3 models.

Numbers30% trigger 8 models; ~60% trigger >3 (Fig.11)

Practical UseVulnerabilities are common across vendors — fixing one model's guardrail may not stop similar failures in others.

Evidence RefSec.1.4, Sec.4.3, Fig.11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Unsafe generation ratio — Open-sourced Chinese group (average)	74.13%	seed questions ~20%	+~54 pp	JADE open-sourced Chinese benchmark	Table in Abstract and Sec.4 (group averages)	Abstract table; Sec.4
Unsafe generation ratio — Commercial English MaaS (average)	74.38%	seed questions ~20%	+~54 pp	JADE commercial English benchmark	Abstract table; Sec.4	Abstract table; Sec.4

What To Try In 7 Days

Run JADE or apply grammar-based mutations to a sample of your safety-critical prompts.

Add PoC prompts to your red-team and retraining datasets to close syntactic blind spots.

Use active prompt tuning to reduce human labeling by focusing on uncertain cases.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/whitzard-ai/jade-db

Data URLs

https://github.com/whitzard-ai/jade-db https://whitzard-ai.github.io/jade.html

Risks & Boundaries

Limitations

Relies on constituency parser quality and hand-crafted rules, so coverage depends on parser and rule set.

Currently supports Chinese and English rules; other languages need new rule engineering.

When Not To Use

When you need to evaluate non-syntactic failures like hallucinations or factual inconsistency.

When evaluating multimodal models or languages without implemented grammar rules.

Failure Modes

Models fail on high syntactic complexity and long dependency distances (deep parse trees).

Models are distracted by added constituents, causing logical inconsistency.

Core Entities

Models

ChatGLM-6BChatGLM2InternLMZiya-LLaMA-13BBaichuan2-7B-chatBELLE-7B-2MSFTChatYuan-large-v2ChatGPT (gpt-3.5-turbo)Claude (Claude-instant)PaLM2Llama-2-70b-chatDoubaoWenxin YiyanSenseChatABAB

Metrics

unsafe generation ratioperplexity (PPL)semantic similarity (embedding cosine)transferability (models triggered)query count to PoC

Datasets

JADE PoC benchmark — open-sourced ChineseJADE PoC benchmark — commercial ChineseJADE PoC benchmark — commercial English

Benchmarks

JADE-generated safety benchmarks (three groups)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Mutating seed questions raises unsafe-generation from ~20% to ~70% on evaluated models.

Many PoC prompts transfer across models: ~30% of Chinese open-source PoCs trigger all eight tested models; ~60% trigger >3 models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding