New Bangla riddle benchmark shows LLMs often copy surface words but fail real riddle reasoning

December 23, 20258 min

Overview

Production Readiness

0.3

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

0

Authors

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Khushnur Binte Jahangir, Swakkhar Shatabda, Sarah Masud Preum

Links

Abstract / PDF

Why It Matters For Business

If you build Bangla NLP products that must reason with cultural metaphors or resolve wordplay, off-the-shelf LLMs are not yet reliable. Superficial word overlap can mask incorrect reasoning. Use targeted benchmarks like BANGLARIDDLEEVAL to validate models before deployment.

Summary TLDR

The authors release BANGLARIDDLEEVAL, a 1,244‑riddle benchmark (4 tasks, 4,976 artifacts) to test LLM reasoning on Bangla riddles. Strong models match words (high BERTScore) but often give wrong answers: best generative correctness ~29% and MCQ tops at ~56% vs 83% human. Ambiguity resolution and explanations show mixed results. Dataset, code, and scripts are on GitHub.

Problem Statement

Current LLMs are rarely tested on figurative, culturally grounded, low-resource languages. Bangla riddles use metaphors, wordplay, and local cultural cues that standard benchmarks miss. The paper builds a focused, multi-task benchmark to measure how well models actually reason in Bangla riddles rather than match surface patterns.

Main Contribution

Created BANGLARIDDLEEVAL: 1,244 unique Bangla riddles, instantiated across four tasks (generative QA, reasoning/explanation, MCQ, semantic ambiguity), totalling 4,976 artifacts.

Built an LLM-driven pipeline to generate step-by-step explanations (Chain-of-Thought), semantically plausible distractors, and fine-grained ambiguity options with quality checks.

Evaluated a diverse set of open-source and closed-source LLMs (e.g., Gemini-2.5-Flash, GPT-OSS-20B, Qwen3-14B) under zero-shot, few-shot, and CoT prompts, and compared to human baselines.

Released data, code, and evaluation scripts publicly (GitHub) to enable follow-up work in low-resource, figurative reasoning.

Key Findings

Dataset size and structure

Numbers1,244 riddles -> 4 tasks -> 4,976 artifacts

MCQ performance lags humans

NumbersTop MCQ accuracy 56.33% (Gemini CoT) vs human 83%

Generative answers show surface match but low semantic correctness

NumbersBERTScore F1 ≈ 0.74–0.81 while LLM-as-a-Judge accuracy ≈ 2–29%

Ambiguity resolution is moderate and uneven

NumbersAccuracy range ≈ 26%–68%, best zero-shot 67.67% (Gemini)

Explanation quality concentrates in top models

NumbersReasoning LLM-as-a-Judge scores up to 8.71/10 (Gemini), many models <6

Prompting effects are inconsistent

NumbersCoT helps some models (MCQ/aggregate) but can hurt lexical disambiguation; zero-shot often competitive

Results

Accuracy

Value56.33%

Baseline25% random

Generative QA semantic overlap (BERTScore F1 best)

Value0.814

Generative QA judged correctness (LLM-as-a-Judge best)

Value29.13%

Baselineexpected human >>

Semantic Ambiguity Resolution (best)

Value67.67%

Baseline25% random

Reasoning quality (LLM-as-a-Judge score best)

Value8.71 / 10

Who Should Care

What To Try In 7 Days

Run your model(s) on a sampled subset of BANGLARIDDLEEVAL to spot surface-match failures quickly.

Compare BERTScore vs human/LLM-judge labels to find cases where lexical similarity hides wrong answers.

Test zero-shot, few-shot, and CoT prompts per model—measure each task separately (MCQ vs ambiguity vs explanation).

Agent Features

Tool Use

  • LLMs used as data generators and evaluators (GPT-4o, Gemini-2.5-Flash)
  • OCR pipeline for book digitization

Architectures

  • multilingual LLMs
  • instruction-tuned models

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Riddles sourced from a limited set of printed books; may under-represent dialects and regional variants.
  • A large portion of annotations (explanations, distractors, ambiguity labels) and the judge rely on proprietary LLMs (e.g., GPT-4o, Gemini), which can introduce bias.
  • Evaluation covers a specific set of models and prompts; results are indicative, not definitive upper bounds.
  • BERTScore overestimates correctness for generative outputs; human or judge validation remains necessary.

When Not To Use

  • Do not use BANGLARIDDLEEVAL as the sole test for dialectal or regional Bangla coverage.
  • Avoid using benchmark results as the only signal for high-stakes decisions (legal, medical) without human verification.

Failure Modes

  • Surface-match bias: high BERTScore with incorrect semantic answer.
  • Distractor vulnerability: plausible wrong options often fool models.
  • CoT harms lexical disambiguation in some cases—chain-of-thought not universally helpful.
  • Judge bias: reliance on a single LLM judge can propagate evaluation artifacts.

Core Entities

Models

  • Gemini-2.5-Flash
  • GPT-OSS-20B
  • Qwen3-14B
  • Qwen3-8B
  • Qwen3-4B
  • Gemma3-12B
  • Gemma3-4B
  • DeepSeek-R1-14B
  • DeepSeek-R1-7B
  • GPT-4o

Metrics

  • Accuracy
  • BERTScore F1
  • LLM-as-a-Judge reasoning score (0-10)

Datasets

  • BANGLARIDDLEEVAL

Benchmarks

  • BANGLARIDDLEEVAL

Context Entities

Datasets

  • Printed Bangla riddle books (1000 dhadha; ধাঁধা মোনেই বাঁধা; মজার ধাঁধা)