Overview
Production Readiness
0.3
Novelty Score
0.7
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
If you build Bangla NLP products that must reason with cultural metaphors or resolve wordplay, off-the-shelf LLMs are not yet reliable. Superficial word overlap can mask incorrect reasoning. Use targeted benchmarks like BANGLARIDDLEEVAL to validate models before deployment.
Summary TLDR
The authors release BANGLARIDDLEEVAL, a 1,244‑riddle benchmark (4 tasks, 4,976 artifacts) to test LLM reasoning on Bangla riddles. Strong models match words (high BERTScore) but often give wrong answers: best generative correctness ~29% and MCQ tops at ~56% vs 83% human. Ambiguity resolution and explanations show mixed results. Dataset, code, and scripts are on GitHub.
Problem Statement
Current LLMs are rarely tested on figurative, culturally grounded, low-resource languages. Bangla riddles use metaphors, wordplay, and local cultural cues that standard benchmarks miss. The paper builds a focused, multi-task benchmark to measure how well models actually reason in Bangla riddles rather than match surface patterns.
Main Contribution
Created BANGLARIDDLEEVAL: 1,244 unique Bangla riddles, instantiated across four tasks (generative QA, reasoning/explanation, MCQ, semantic ambiguity), totalling 4,976 artifacts.
Built an LLM-driven pipeline to generate step-by-step explanations (Chain-of-Thought), semantically plausible distractors, and fine-grained ambiguity options with quality checks.
Evaluated a diverse set of open-source and closed-source LLMs (e.g., Gemini-2.5-Flash, GPT-OSS-20B, Qwen3-14B) under zero-shot, few-shot, and CoT prompts, and compared to human baselines.
Released data, code, and evaluation scripts publicly (GitHub) to enable follow-up work in low-resource, figurative reasoning.
Key Findings
Dataset size and structure
MCQ performance lags humans
Generative answers show surface match but low semantic correctness
Ambiguity resolution is moderate and uneven
Explanation quality concentrates in top models
Prompting effects are inconsistent
Results
Accuracy
Generative QA semantic overlap (BERTScore F1 best)
Generative QA judged correctness (LLM-as-a-Judge best)
Semantic Ambiguity Resolution (best)
Reasoning quality (LLM-as-a-Judge score best)
Who Should Care
What To Try In 7 Days
Run your model(s) on a sampled subset of BANGLARIDDLEEVAL to spot surface-match failures quickly.
Compare BERTScore vs human/LLM-judge labels to find cases where lexical similarity hides wrong answers.
Test zero-shot, few-shot, and CoT prompts per model—measure each task separately (MCQ vs ambiguity vs explanation).
Agent Features
Tool Use
- LLMs used as data generators and evaluators (GPT-4o, Gemini-2.5-Flash)
- OCR pipeline for book digitization
Architectures
- multilingual LLMs
- instruction-tuned models
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Riddles sourced from a limited set of printed books; may under-represent dialects and regional variants.
- A large portion of annotations (explanations, distractors, ambiguity labels) and the judge rely on proprietary LLMs (e.g., GPT-4o, Gemini), which can introduce bias.
- Evaluation covers a specific set of models and prompts; results are indicative, not definitive upper bounds.
- BERTScore overestimates correctness for generative outputs; human or judge validation remains necessary.
When Not To Use
- Do not use BANGLARIDDLEEVAL as the sole test for dialectal or regional Bangla coverage.
- Avoid using benchmark results as the only signal for high-stakes decisions (legal, medical) without human verification.
Failure Modes
- Surface-match bias: high BERTScore with incorrect semantic answer.
- Distractor vulnerability: plausible wrong options often fool models.
- CoT harms lexical disambiguation in some cases—chain-of-thought not universally helpful.
- Judge bias: reliance on a single LLM judge can propagate evaluation artifacts.
Core Entities
Models
- Gemini-2.5-Flash
- GPT-OSS-20B
- Qwen3-14B
- Qwen3-8B
- Qwen3-4B
- Gemma3-12B
- Gemma3-4B
- DeepSeek-R1-14B
- DeepSeek-R1-7B
- GPT-4o
Metrics
- Accuracy
- BERTScore F1
- LLM-as-a-Judge reasoning score (0-10)
Datasets
- BANGLARIDDLEEVAL
Benchmarks
- BANGLARIDDLEEVAL
Context Entities
Datasets
- Printed Bangla riddle books (1000 dhadha; ধাঁধা মোনেই বাঁধা; মজার ধাঁধা)

