Overview
Method is straightforward and training-free and shows consistent gains on evaluated multilingual QA sets; however, it increases inference calls and depends on language selection and LLM behaviour.
Citations0
Evidence Strength0.80
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
CausalAbstain reduces wrong answers across multiple languages by selectively using model feedback, improving trust in multilingual QA systems while trading off higher API cost for better safety.
Who Should Care
Summary TLDR
The paper introduces CausalAbstain, a training-free way to decide whether to use LLM-generated feedback when an LLM evaluates its own answers in multiple languages. It measures causal paths from an answer to the abstention decision using natural direct effect (NDE) and total indirect effect (TIE) computed from repeated prompts and Jensen‑Shannon divergence. Two modes are proposed: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (related-language feedback with aggregated voting). Evaluated on multilingual M-MMLU and M-Hellaswag with ChatGPT, GPT-4o, Aya-13B (and additional tests on LLaMa/Phi), CAUSAL-MULTI improves abstain accuracy vs strong baselines (avg +3.5% on evaluated setups)
Problem Statement
LLMs hallucinate more in low-resource languages. Existing feedback-based abstention methods simply trust model-generated feedback, which can be incorrect or biased across languages. The problem is deciding when to use generated feedback and which feedback to trust so the model abstains correctly instead of amplifying errors.
Main Contribution
A causality-based, training-free framework (CausalAbstain) that compares direct and feedback-mediated effects to decide whether to use LLM feedback.
Two operational modes: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (multiple related languages with aggregated voting).
Key Findings
CAUSAL-MULTI outperforms prior methods on the evaluated benchmarks.
Filtering feedback via causal effect comparison is valuable; ignoring feedback hurts performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | +3.5% vs best competing method | best competing method (varies by model/dataset) | +3.5% | aggregate across three LLMs × two datasets | CAUSAL-MULTI outperforms strongest baseline in 4/6 settings and averages +3.5% (Table 1) | Table 1; §4.2 |
| Accuracy | 0.738 | MULTI-RELATED 0.725 (example) | +0.013 | GPT-4o on M-MMLU overall | Table 1 reports GPT-4o CAUSAL-MULTI overall = 0.738 | Table 1 |
What To Try In 7 Days
Run CAUSAL-NATIVE (N=3) on a small multilingual QA sample to measure abstain accuracy vs your current setup.
Implement NDE vs TIE comparison with JSD scores to filter feedback before final decisions.
If cross-language robustness matters, run CAUSAL-MULTI on key languages and measure cost vs benefit using the 10-call protocol.
Reproducibility
Risks & Boundaries
Limitations
Higher inference cost: CAUSAL-MULTI uses ~10 calls per query at N=3, increasing API expense.
Relies on LLM-generated feedback quality; severely incorrect feedback can still mislead if not properly filtered.
When Not To Use
When API cost per query must be minimal and you cannot afford multiple feedback calls.
When the target language lacks related languages in the chosen pool for CAUSAL-MULTI.
Failure Modes
Noisy or adversarial feedback in multiple languages can skew aggregated votes.
Incorrect language relatedness choices may reduce performance or produce wrong abstain choices.

