Judge-free, multilingual jailbreak stress test for 12 South Asian languages with 45k+ prompts

February 18, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.68

Cost Impact Score

0.5

Citation Count

0

Authors

Priyaranjan Pattnayak, Sanchari Chowdhuri

Links

Abstract / PDF

Why It Matters For Business

Multilingual and script-mixed users face real jailbreak risk that contracts can hide; evaluate product safety in the languages and input styles your users use to avoid unnoticed exposures.

Summary TLDR

IJR is a reproducible, judge-free benchmark that measures jailbreak vulnerability across 12 South Asian languages using two tracks: JSON (contract-bound refusals) and FREE (unconstrained responses). It contains 45,216 prompts and finds high contracted jailbreak rates in many models (several >0.9 JSR), strong English→Indic transfer, and large orthography effects (romanization reduces JSON JSR by ≈0.34). Human audits show detectors are reliable (≈95% schema validity, 4.3% false negatives). The dataset and scoring scripts are released.

Problem Statement

Existing jailbreak and safety tests focus on English and often use learned judges. This misses multilingual vulnerabilities, script-mixing, and romanization common in South Asia. The paper creates a judge-free, multilanguage benchmark to reveal risks hidden by English-only, contract-focused evaluations.

Main Contribution

IndicJR dataset: 45,216 prompts across 12 South Asian languages, with JSON (42,636) and FREE (2,580) tracks.

Judge-free protocol: deterministic, language-aware parsing that scores refusals without external LLM judges.

Stress tests: cross-lingual transfer, orthography (native/romanized/mixed), and lite-vs-full reproducibility checks.

Empirical study of 12 models (open-weight, API, Indic-specialized) showing contract gaps, orthographic effects, and transfer vulnerabilities.

Validation: human audits (N=600) and lite/full correlation (r ≈ 0.8) to support reliability and reproducibility.

Key Findings

High contracted jailbreak rates across many models

NumbersLLaMA 3.1 JSR 0.922; LLaMA 3.3 0.978; Sarvam 0.959

FREE track shows near-universal jailbreak success

NumbersFREE attacked-benign JSR ≈ 1.0 across models

Romanized / mixed orthography lowers JSON JSR systematically

NumbersMean JSR native 0.755 → romanized 0.416 (∆ -0.338)

English → Indic adversarial prompts transfer strongly

NumbersPer-language E2 mean JSRs 0.585–0.694; Urdu ≈0.694

Judge-free detectors validated by humans with low leakage

NumbersHuman audit: κ≈0.68 unweighted, false negatives 4.3%, schema validity 95.4%, canary leakage 0%

Indic-specialized model (Sarvam) is not safer by default

NumbersSarvam JSON JSR 0.959, schema validity 0.186, CH leakage 0.393

Lite sampling replicates full evaluations well

NumbersLite vs full per-language correlation r>0.80 for most models

Results

Dataset size (total prompts)

Value45,216 prompts (JSON 42,636; FREE 2,580)

JSON-track attacked-benign JSR (example models)

ValueLLaMA 3.1 0.922; LLaMA 3.3 0.978; Sarvam 0.959; GPT-4o 0.508

FREE-track attacked-benign JSR

Value≈1.0 (near-universal jailbreaks)

BaselineJSON JSRs (lower for some APIs)

Orthography effect (mean ∆ JSR)

ValueRomanized - native ∆ JSR ≈ -0.338 (mean over lang/models)

Baselinenative-script JSR ≈ 0.755

Cross-lingual transfer (English→Indic mean JSR)

ValuePer-language means 0.585–0.694 (Urdu ≈0.694)

Human audit agreement

ValueCohen's κ ≈ 0.68 (unweighted), 0.74 (weighted); false negatives 4.3%

Canary leakage

Value0% canary leakage (no canary leaks observed)

Who Should Care

What To Try In 7 Days

Run IJR's FREE and JSON tracks (or a lite subset) on your top models to surface contract gaps.

Test romanized and mixed-script inputs from real user logs to spot tokenization-driven failures.

Include English→local wrappers and format-forcing attacks (JSON/YAML) in red-team suites.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-turn prompts only; multi-turn jailbreaks are not covered.
  • Only three harmful intent categories (chemistry, bio, illicit access).
  • Romanization uses standardized transliteration and may not capture noisy user romanization.
  • Judge-free heuristics may miss subtle or domain-specific leakage despite audit checks.
  • Evaluation uses fixed inference settings and cannot account for provider-side runtime safety layers.

When Not To Use

  • As the only safety test for multi-turn chat or dialog systems.
  • To claim full safety across domains beyond chem/bio/security.
  • As a substitute for localized human review when subtle contextual leakage matters.

Failure Modes

  • Judge-free heuristics can miss subtle unsafe guidance (≈4% false negatives).
  • Romanization generator may not reflect noisy real-world user spellings.
  • Contract parsing can mark malformed JSON as ABSTAIN, hiding nuanced behavior.
  • Provider-side safety filters or post-processing may alter deployed behavior versus benchmarked outputs.

Core Entities

Models

  • GPT-4o
  • Grok-3
  • Grok-4
  • Cohere Command-R
  • Cohere Command-A
  • LLaMA 3.1 405B
  • LLaMA 3.3 70B
  • LLaMA 4 Maverick 17B
  • Ministral 8B Instruct
  • Qwen 1.5 7B
  • Gemma 2 9B
  • Sarvam 1 Base

Metrics

  • JSR (Jailbreak Success Rate)
  • Schema Validity (SV)
  • Leakage Rate (LR)
  • Over-refusal
  • Under-refusal
  • Refusal Robustness Index (RRI)
  • ∆JSR (variant - native)

Datasets

  • IndicJR (IJR) 45,216 prompts (JSON + FREE)
  • Wikipedia 2023 benign cores (source for benign prompts)

Benchmarks

  • JailbreakBench
  • HELM
  • SafetyBench
  • IndoSafety
  • PolyGuard
  • IndicGenBench