Overview
Production Readiness
0.6
Novelty Score
0.68
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Multilingual and script-mixed users face real jailbreak risk that contracts can hide; evaluate product safety in the languages and input styles your users use to avoid unnoticed exposures.
Summary TLDR
IJR is a reproducible, judge-free benchmark that measures jailbreak vulnerability across 12 South Asian languages using two tracks: JSON (contract-bound refusals) and FREE (unconstrained responses). It contains 45,216 prompts and finds high contracted jailbreak rates in many models (several >0.9 JSR), strong English→Indic transfer, and large orthography effects (romanization reduces JSON JSR by ≈0.34). Human audits show detectors are reliable (≈95% schema validity, 4.3% false negatives). The dataset and scoring scripts are released.
Problem Statement
Existing jailbreak and safety tests focus on English and often use learned judges. This misses multilingual vulnerabilities, script-mixing, and romanization common in South Asia. The paper creates a judge-free, multilanguage benchmark to reveal risks hidden by English-only, contract-focused evaluations.
Main Contribution
IndicJR dataset: 45,216 prompts across 12 South Asian languages, with JSON (42,636) and FREE (2,580) tracks.
Judge-free protocol: deterministic, language-aware parsing that scores refusals without external LLM judges.
Stress tests: cross-lingual transfer, orthography (native/romanized/mixed), and lite-vs-full reproducibility checks.
Empirical study of 12 models (open-weight, API, Indic-specialized) showing contract gaps, orthographic effects, and transfer vulnerabilities.
Validation: human audits (N=600) and lite/full correlation (r ≈ 0.8) to support reliability and reproducibility.
Key Findings
High contracted jailbreak rates across many models
FREE track shows near-universal jailbreak success
Romanized / mixed orthography lowers JSON JSR systematically
English → Indic adversarial prompts transfer strongly
Judge-free detectors validated by humans with low leakage
Indic-specialized model (Sarvam) is not safer by default
Lite sampling replicates full evaluations well
Results
Dataset size (total prompts)
JSON-track attacked-benign JSR (example models)
FREE-track attacked-benign JSR
Orthography effect (mean ∆ JSR)
Cross-lingual transfer (English→Indic mean JSR)
Human audit agreement
Canary leakage
Who Should Care
What To Try In 7 Days
Run IJR's FREE and JSON tracks (or a lite subset) on your top models to surface contract gaps.
Test romanized and mixed-script inputs from real user logs to spot tokenization-driven failures.
Include English→local wrappers and format-forcing attacks (JSON/YAML) in red-team suites.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-turn prompts only; multi-turn jailbreaks are not covered.
- Only three harmful intent categories (chemistry, bio, illicit access).
- Romanization uses standardized transliteration and may not capture noisy user romanization.
- Judge-free heuristics may miss subtle or domain-specific leakage despite audit checks.
- Evaluation uses fixed inference settings and cannot account for provider-side runtime safety layers.
When Not To Use
- As the only safety test for multi-turn chat or dialog systems.
- To claim full safety across domains beyond chem/bio/security.
- As a substitute for localized human review when subtle contextual leakage matters.
Failure Modes
- Judge-free heuristics can miss subtle unsafe guidance (≈4% false negatives).
- Romanization generator may not reflect noisy real-world user spellings.
- Contract parsing can mark malformed JSON as ABSTAIN, hiding nuanced behavior.
- Provider-side safety filters or post-processing may alter deployed behavior versus benchmarked outputs.
Core Entities
Models
- GPT-4o
- Grok-3
- Grok-4
- Cohere Command-R
- Cohere Command-A
- LLaMA 3.1 405B
- LLaMA 3.3 70B
- LLaMA 4 Maverick 17B
- Ministral 8B Instruct
- Qwen 1.5 7B
- Gemma 2 9B
- Sarvam 1 Base
Metrics
- JSR (Jailbreak Success Rate)
- Schema Validity (SV)
- Leakage Rate (LR)
- Over-refusal
- Under-refusal
- Refusal Robustness Index (RRI)
- ∆JSR (variant - native)
Datasets
- IndicJR (IJR) 45,216 prompts (JSON + FREE)
- Wikipedia 2023 benign cores (source for benign prompts)
Benchmarks
- JailbreakBench
- HELM
- SafetyBench
- IndoSafety
- PolyGuard
- IndicGenBench

