Judge-free, multilingual jailbreak stress test for 12 South Asian languages with 45k+ prompts

Overview

Decision SnapshotNeeds Validation

The benchmark is ready for evaluation and red-teaming in industry but is limited to single-turn prompts, three intent domains, and relies on heuristic detectors with a small audit error rate.

Citations0

Evidence Strength0.85

Confidence0.88

Risk Signals12

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 68%

Authors

Priyaranjan Pattnayak, Sanchari Chowdhuri

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multilingual and script-mixed users face real jailbreak risk that contracts can hide; evaluate product safety in the languages and input styles your users use to avoid unnoticed exposures.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

IJR is a reproducible, judge-free benchmark that measures jailbreak vulnerability across 12 South Asian languages using two tracks: JSON (contract-bound refusals) and FREE (unconstrained responses). It contains 45,216 prompts and finds high contracted jailbreak rates in many models (several >0.9 JSR), strong English→Indic transfer, and large orthography effects (romanization reduces JSON JSR by ≈0.34). Human audits show detectors are reliable (≈95% schema validity, 4.3% false negatives). The dataset and scoring scripts are released.

Problem Statement

Existing jailbreak and safety tests focus on English and often use learned judges. This misses multilingual vulnerabilities, script-mixing, and romanization common in South Asia. The paper creates a judge-free, multilanguage benchmark to reveal risks hidden by English-only, contract-focused evaluations.

Main Contribution

IndicJR dataset: 45,216 prompts across 12 South Asian languages, with JSON (42,636) and FREE (2,580) tracks.

Judge-free protocol: deterministic, language-aware parsing that scores refusals without external LLM judges.

Key Findings

High contracted jailbreak rates across many models

NumbersLLaMA 3.1 JSR 0.922; LLaMA 3.3 0.978; Sarvam 0.959

Practical UseDo not trust low refusal rates under contracts alone; test models under unconstrained FREE prompts and attacked-benign wrappers.

Evidence RefTable 2 (JSON JSR per model)

FREE track shows near-universal jailbreak success

NumbersFREE attacked-benign JSR ≈ 1.0 across models

Practical UseEvaluate models in natural interaction mode (FREE) to measure real-world risk; contracts can mask vulnerabilities.

Evidence RefSection 6.1 and Table 2 FREE JSR

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size (total prompts)	45,216 prompts (JSON 42,636; FREE 2,580)	—	—	IJR	Table 5 and main text	—
JSON-track attacked-benign JSR (example models)	LLaMA 3.1 0.922; LLaMA 3.3 0.978; Sarvam 0.959; GPT-4o 0.508	—	—	JSON attacked-benign (E1)	Table 2 model JSRs	—

What To Try In 7 Days

Run IJR's FREE and JSON tracks (or a lite subset) on your top models to surface contract gaps.

Test romanized and mixed-script inputs from real user logs to spot tokenization-driven failures.

Include English→local wrappers and format-forcing attacks (JSON/YAML) in red-team suites.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IndicJR

Data URLs

https://github.com/IndicJR

Risks & Boundaries

Limitations

Single-turn prompts only; multi-turn jailbreaks are not covered.

Only three harmful intent categories (chemistry, bio, illicit access).

When Not To Use

As the only safety test for multi-turn chat or dialog systems.

To claim full safety across domains beyond chem/bio/security.

Failure Modes

Judge-free heuristics can miss subtle unsafe guidance (≈4% false negatives).

Romanization generator may not reflect noisy real-world user spellings.

Core Entities

Models

GPT-4oGrok-3Grok-4Cohere Command-RCohere Command-ALLaMA 3.1 405BLLaMA 3.3 70BLLaMA 4 Maverick 17BMinistral 8B InstructQwen 1.5 7BGemma 2 9BSarvam 1 Base

Metrics

JSR (Jailbreak Success Rate)Schema Validity (SV)Leakage Rate (LR)Over-refusalUnder-refusalRefusal Robustness Index (RRI)∆JSR (variant - native)

Datasets

IndicJR (IJR) 45,216 prompts (JSON + FREE)Wikipedia 2023 benign cores (source for benign prompts)

Benchmarks

JailbreakBenchHELMSafetyBenchIndoSafetyPolyGuardIndicGenBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High contracted jailbreak rates across many models

FREE track shows near-universal jailbreak success

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding