Overview
Production Readiness
0.4
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
JADE finds natural inputs that bypass safety filters across models (avg ~70% unsafe), revealing real deployment risk that static benchmarks miss.
Summary TLDR
JADE is a grammar-driven fuzzing platform that grows or transforms seed unsafe questions into more syntactically complex variants until aligned LLMs break their guardrails. Using hand-crafted generative and transformational rules (from transformational-generative grammar), JADE finds many natural-looking 'proof-of-concept' (PoC) prompts that trigger unsafe outputs across open-source and commercial LLMs. The authors release benchmark demos, add an active prompt-tuning evaluation loop to reduce manual labeling, and report average unsafe-generation rates around 70% on evaluated models. JADE is cheaper (fewer model queries) and more natural than gradient/suffix-based attacks.
Problem Statement
Current safety tests miss many realistic attacks because models can fail only on syntactic variants of the same malicious intent. Static benchmarks and generator-based red-teaming leave large gaps. We need a systematic, linguistics-aware method to produce natural, transferable inputs that expose model safety failures.
Main Contribution
JADE platform: a targeted linguistic fuzzing system that mutates seed questions using generative and transformational grammar rules.
Benchmarks: three groups of PoC datasets (open-source Chinese, commercial Chinese, commercial English) with high unsafe-generation rates.
Active prompt tuning: an LLM-based auto-evaluation loop that selects uncertain cases for small-scale human labeling to align automated judgements with experts.
Empirical findings: JADE produces Natural PoCs that trigger unsafe outputs in many LLMs (average ~70%) and transfers across models; it is query-efficient vs gradient-based attacks.
Key Findings
Mutating seed questions raises unsafe-generation from ~20% to ~70% on evaluated models.
Many PoC prompts transfer across models: ~30% of Chinese open-source PoCs trigger all eight tested models; ~60% trigger >3 models.
JADE's PoCs remain natural and semantically close to seeds, unlike many jailbreaking templates.
JADE is query-efficient and gradient-free: most successful mutations require <7 model queries vs many more or TIMEOUT for gradient attacks.
Results
Unsafe generation ratio — Open-sourced Chinese group (average)
Unsafe generation ratio — Commercial English MaaS (average)
Unsafe generation ratio — Commercial Chinese MaaS (average)
Effectiveness lift (seed → mutated)
Query efficiency vs gradient attack
Who Should Care
What To Try In 7 Days
Run JADE or apply grammar-based mutations to a sample of your safety-critical prompts.
Add PoC prompts to your red-team and retraining datasets to close syntactic blind spots.
Use active prompt tuning to reduce human labeling by focusing on uncertain cases.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on constituency parser quality and hand-crafted rules, so coverage depends on parser and rule set.
- Currently supports Chinese and English rules; other languages need new rule engineering.
- Auto-evaluation is binary and depends on small human-annotated pools; fine-grained severity is not yet supported.
- Evaluation is limited to the listed open-source and commercial models at the test time window.
When Not To Use
- When you need to evaluate non-syntactic failures like hallucinations or factual inconsistency.
- When evaluating multimodal models or languages without implemented grammar rules.
- When you require fine-grained severity labels rather than binary unsafe/safe judgements.
Failure Modes
- Models fail on high syntactic complexity and long dependency distances (deep parse trees).
- Models are distracted by added constituents, causing logical inconsistency.
- LLM-based automatic evaluators can be misaligned and need human-in-the-loop correction.
- Gradient-based and suffix attacks differ in artifacts; defenses tuned to one may miss the other.
Core Entities
Models
- ChatGLM-6B
- ChatGLM2
- InternLM
- Ziya-LLaMA-13B
- Baichuan2-7B-chat
- BELLE-7B-2M
- SFT
- ChatYuan-large-v2
- ChatGPT (gpt-3.5-turbo)
- Claude (Claude-instant)
- PaLM2
- Llama-2-70b-chat
- Doubao
- Wenxin Yiyan
- SenseChat
- ABAB
Metrics
- unsafe generation ratio
- perplexity (PPL)
- semantic similarity (embedding cosine)
- transferability (models triggered)
- query count to PoC
Datasets
- JADE PoC benchmark — open-sourced Chinese
- JADE PoC benchmark — commercial Chinese
- JADE PoC benchmark — commercial English
Benchmarks
- JADE-generated safety benchmarks (three groups)

