Judge-free, multilingual jailbreak stress test for 12 South Asian languages with 45k+ prompts

February 18, 20268 min

Overview

Decision SnapshotNeeds Validation

The benchmark is ready for evaluation and red-teaming in industry but is limited to single-turn prompts, three intent domains, and relies on heuristic detectors with a small audit error rate.

Citations0

Evidence Strength0.85

Confidence0.88

Risk Signals12

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 68%

Authors

Priyaranjan Pattnayak, Sanchari Chowdhuri

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multilingual and script-mixed users face real jailbreak risk that contracts can hide; evaluate product safety in the languages and input styles your users use to avoid unnoticed exposures.

Who Should Care

Summary TLDR

IJR is a reproducible, judge-free benchmark that measures jailbreak vulnerability across 12 South Asian languages using two tracks: JSON (contract-bound refusals) and FREE (unconstrained responses). It contains 45,216 prompts and finds high contracted jailbreak rates in many models (several >0.9 JSR), strong English→Indic transfer, and large orthography effects (romanization reduces JSON JSR by ≈0.34). Human audits show detectors are reliable (≈95% schema validity, 4.3% false negatives). The dataset and scoring scripts are released.

Problem Statement

Existing jailbreak and safety tests focus on English and often use learned judges. This misses multilingual vulnerabilities, script-mixing, and romanization common in South Asia. The paper creates a judge-free, multilanguage benchmark to reveal risks hidden by English-only, contract-focused evaluations.

Main Contribution

IndicJR dataset: 45,216 prompts across 12 South Asian languages, with JSON (42,636) and FREE (2,580) tracks.

Judge-free protocol: deterministic, language-aware parsing that scores refusals without external LLM judges.

Key Findings

High contracted jailbreak rates across many models

NumbersLLaMA 3.1 JSR 0.922; LLaMA 3.3 0.978; Sarvam 0.959

Practical UseDo not trust low refusal rates under contracts alone; test models under unconstrained FREE prompts and attacked-benign wrappers.

Evidence RefTable 2 (JSON JSR per model)

FREE track shows near-universal jailbreak success

NumbersFREE attacked-benign JSR ≈ 1.0 across models

Practical UseEvaluate models in natural interaction mode (FREE) to measure real-world risk; contracts can mask vulnerabilities.

Evidence RefSection 6.1 and Table 2 FREE JSR

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size (total prompts)45,216 prompts (JSON 42,636; FREE 2,580)IJRTable 5 and main text
JSON-track attacked-benign JSR (example models)LLaMA 3.1 0.922; LLaMA 3.3 0.978; Sarvam 0.959; GPT-4o 0.508JSON attacked-benign (E1)Table 2 model JSRs

What To Try In 7 Days

Run IJR's FREE and JSON tracks (or a lite subset) on your top models to surface contract gaps.

Test romanized and mixed-script inputs from real user logs to spot tokenization-driven failures.

Include English→local wrappers and format-forcing attacks (JSON/YAML) in red-team suites.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single-turn prompts only; multi-turn jailbreaks are not covered.

Only three harmful intent categories (chemistry, bio, illicit access).

When Not To Use

As the only safety test for multi-turn chat or dialog systems.

To claim full safety across domains beyond chem/bio/security.

Failure Modes

Judge-free heuristics can miss subtle unsafe guidance (≈4% false negatives).

Romanization generator may not reflect noisy real-world user spellings.

Core Entities

Models

GPT-4oGrok-3Grok-4Cohere Command-RCohere Command-ALLaMA 3.1 405BLLaMA 3.3 70BLLaMA 4 Maverick 17BMinistral 8B InstructQwen 1.5 7BGemma 2 9BSarvam 1 Base

Metrics

JSR (Jailbreak Success Rate)Schema Validity (SV)Leakage Rate (LR)Over-refusalUnder-refusalRefusal Robustness Index (RRI)∆JSR (variant - native)

Datasets

IndicJR (IJR) 45,216 prompts (JSON + FREE)Wikipedia 2023 benign cores (source for benign prompts)

Benchmarks

JailbreakBenchHELMSafetyBenchIndoSafetyPolyGuardIndicGenBench