JADE: use grammar-based mutations to find natural inputs that bypass LLM safety guards

November 1, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

1

Authors

Mi Zhang, Xudong Pan, Min Yang

Links

Abstract / PDF

Why It Matters For Business

JADE finds natural inputs that bypass safety filters across models (avg ~70% unsafe), revealing real deployment risk that static benchmarks miss.

Summary TLDR

JADE is a grammar-driven fuzzing platform that grows or transforms seed unsafe questions into more syntactically complex variants until aligned LLMs break their guardrails. Using hand-crafted generative and transformational rules (from transformational-generative grammar), JADE finds many natural-looking 'proof-of-concept' (PoC) prompts that trigger unsafe outputs across open-source and commercial LLMs. The authors release benchmark demos, add an active prompt-tuning evaluation loop to reduce manual labeling, and report average unsafe-generation rates around 70% on evaluated models. JADE is cheaper (fewer model queries) and more natural than gradient/suffix-based attacks.

Problem Statement

Current safety tests miss many realistic attacks because models can fail only on syntactic variants of the same malicious intent. Static benchmarks and generator-based red-teaming leave large gaps. We need a systematic, linguistics-aware method to produce natural, transferable inputs that expose model safety failures.

Main Contribution

JADE platform: a targeted linguistic fuzzing system that mutates seed questions using generative and transformational grammar rules.

Benchmarks: three groups of PoC datasets (open-source Chinese, commercial Chinese, commercial English) with high unsafe-generation rates.

Active prompt tuning: an LLM-based auto-evaluation loop that selects uncertain cases for small-scale human labeling to align automated judgements with experts.

Empirical findings: JADE produces Natural PoCs that trigger unsafe outputs in many LLMs (average ~70%) and transfers across models; it is query-efficient vs gradient-based attacks.

Key Findings

Mutating seed questions raises unsafe-generation from ~20% to ~70% on evaluated models.

Numbersseed ≈20% → mutated ≈70% (Δ ≈ +50 percentage points)

Many PoC prompts transfer across models: ~30% of Chinese open-source PoCs trigger all eight tested models; ~60% trigger >3 models.

Numbers30% trigger 8 models; ~60% trigger >3 (Fig.11)

JADE's PoCs remain natural and semantically close to seeds, unlike many jailbreaking templates.

NumbersPerplexity comparable to seeds; higher semantic similarity vs jailbreaking (Fig.13)

JADE is query-efficient and gradient-free: most successful mutations require <7 model queries vs many more or TIMEOUT for gradient attacks.

NumbersMost JADE mutations <7 queries; GCG often 50+ or TIMEOUT (Table3)

Results

Unsafe generation ratio — Open-sourced Chinese group (average)

Value74.13%

Baselineseed questions ~20%

Unsafe generation ratio — Commercial English MaaS (average)

Value74.38%

Baselineseed questions ~20%

Unsafe generation ratio — Commercial Chinese MaaS (average)

Value77.5%

Baselineseed questions ~20%

Effectiveness lift (seed → mutated)

Valueseed ≈20% → mutated ≈70%

Baselineseed questions

Query efficiency vs gradient attack

Valuemost JADE mutations <7 queries; GCG often >50 or TIMEOUT

BaselineGCG attack

Who Should Care

What To Try In 7 Days

Run JADE or apply grammar-based mutations to a sample of your safety-critical prompts.

Add PoC prompts to your red-team and retraining datasets to close syntactic blind spots.

Use active prompt tuning to reduce human labeling by focusing on uncertain cases.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on constituency parser quality and hand-crafted rules, so coverage depends on parser and rule set.
  • Currently supports Chinese and English rules; other languages need new rule engineering.
  • Auto-evaluation is binary and depends on small human-annotated pools; fine-grained severity is not yet supported.
  • Evaluation is limited to the listed open-source and commercial models at the test time window.

When Not To Use

  • When you need to evaluate non-syntactic failures like hallucinations or factual inconsistency.
  • When evaluating multimodal models or languages without implemented grammar rules.
  • When you require fine-grained severity labels rather than binary unsafe/safe judgements.

Failure Modes

  • Models fail on high syntactic complexity and long dependency distances (deep parse trees).
  • Models are distracted by added constituents, causing logical inconsistency.
  • LLM-based automatic evaluators can be misaligned and need human-in-the-loop correction.
  • Gradient-based and suffix attacks differ in artifacts; defenses tuned to one may miss the other.

Core Entities

Models

  • ChatGLM-6B
  • ChatGLM2
  • InternLM
  • Ziya-LLaMA-13B
  • Baichuan2-7B-chat
  • BELLE-7B-2M
  • SFT
  • ChatYuan-large-v2
  • ChatGPT (gpt-3.5-turbo)
  • Claude (Claude-instant)
  • PaLM2
  • Llama-2-70b-chat
  • Doubao
  • Wenxin Yiyan
  • SenseChat
  • ABAB

Metrics

  • unsafe generation ratio
  • perplexity (PPL)
  • semantic similarity (embedding cosine)
  • transferability (models triggered)
  • query count to PoC

Datasets

  • JADE PoC benchmark — open-sourced Chinese
  • JADE PoC benchmark — commercial Chinese
  • JADE PoC benchmark — commercial English

Benchmarks

  • JADE-generated safety benchmarks (three groups)