Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
CausalAbstain reduces wrong answers across multiple languages by selectively using model feedback, improving trust in multilingual QA systems while trading off higher API cost for better safety.
Summary TLDR
The paper introduces CausalAbstain, a training-free way to decide whether to use LLM-generated feedback when an LLM evaluates its own answers in multiple languages. It measures causal paths from an answer to the abstention decision using natural direct effect (NDE) and total indirect effect (TIE) computed from repeated prompts and Jensen‑Shannon divergence. Two modes are proposed: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (related-language feedback with aggregated voting). Evaluated on multilingual M-MMLU and M-Hellaswag with ChatGPT, GPT-4o, Aya-13B (and additional tests on LLaMa/Phi), CAUSAL-MULTI improves abstain accuracy vs strong baselines (avg +3.5% on evaluated setups)
Problem Statement
LLMs hallucinate more in low-resource languages. Existing feedback-based abstention methods simply trust model-generated feedback, which can be incorrect or biased across languages. The problem is deciding when to use generated feedback and which feedback to trust so the model abstains correctly instead of amplifying errors.
Main Contribution
A causality-based, training-free framework (CausalAbstain) that compares direct and feedback-mediated effects to decide whether to use LLM feedback.
Two operational modes: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (multiple related languages with aggregated voting).
Extensive multilingual experiments showing CAUSAL-MULTI improves abstention accuracy over calibration, prompting, consistency, and feedback baselines on M-MMLU and M-Hellaswag.
Key Findings
CAUSAL-MULTI outperforms prior methods on the evaluated benchmarks.
Filtering feedback via causal effect comparison is valuable; ignoring feedback hurts performance.
CAUSAL-MULTI is costlier in API calls but more robust across languages.
Results
Accuracy
Accuracy
Inference requests per query (cost)
Who Should Care
What To Try In 7 Days
Run CAUSAL-NATIVE (N=3) on a small multilingual QA sample to measure abstain accuracy vs your current setup.
Implement NDE vs TIE comparison with JSD scores to filter feedback before final decisions.
If cross-language robustness matters, run CAUSAL-MULTI on key languages and measure cost vs benefit using the 10-call protocol.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher inference cost: CAUSAL-MULTI uses ~10 calls per query at N=3, increasing API expense.
- Relies on LLM-generated feedback quality; severely incorrect feedback can still mislead if not properly filtered.
- Performance depends on choice of related languages and pretraining representation; low-resource languages remain challenging.
When Not To Use
- When API cost per query must be minimal and you cannot afford multiple feedback calls.
- When the target language lacks related languages in the chosen pool for CAUSAL-MULTI.
- For tasks where abstention decisions require external factual verification rather than self-reflection.
Failure Modes
- Noisy or adversarial feedback in multiple languages can skew aggregated votes.
- Incorrect language relatedness choices may reduce performance or produce wrong abstain choices.
- Smaller models may produce low-quality feedback; aggregation may not fully correct systematic biases.
Core Entities
Models
- ChatGPT
- GPT-4o
- Aya-13B
- LLaMa3.2
- Phi4
Metrics
- Accuracy
Datasets
- M-MMLU (Multilingual MMLU)
- M-Hellaswag (Multilingual Hellaswag)

