Use causal effects on multilingual feedback to decide when LLMs should abstain

May 31, 20256 min

Overview

Decision SnapshotReady For Pilot

Method is straightforward and training-free and shows consistent gains on evaluated multilingual QA sets; however, it increases inference calls and depends on language selection and LLM behaviour.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Yuxi Sun, Aoqi Zuo, Wei Gao, Jing Ma

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CausalAbstain reduces wrong answers across multiple languages by selectively using model feedback, improving trust in multilingual QA systems while trading off higher API cost for better safety.

Who Should Care

Summary TLDR

The paper introduces CausalAbstain, a training-free way to decide whether to use LLM-generated feedback when an LLM evaluates its own answers in multiple languages. It measures causal paths from an answer to the abstention decision using natural direct effect (NDE) and total indirect effect (TIE) computed from repeated prompts and Jensen‑Shannon divergence. Two modes are proposed: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (related-language feedback with aggregated voting). Evaluated on multilingual M-MMLU and M-Hellaswag with ChatGPT, GPT-4o, Aya-13B (and additional tests on LLaMa/Phi), CAUSAL-MULTI improves abstain accuracy vs strong baselines (avg +3.5% on evaluated setups)

Problem Statement

LLMs hallucinate more in low-resource languages. Existing feedback-based abstention methods simply trust model-generated feedback, which can be incorrect or biased across languages. The problem is deciding when to use generated feedback and which feedback to trust so the model abstains correctly instead of amplifying errors.

Main Contribution

A causality-based, training-free framework (CausalAbstain) that compares direct and feedback-mediated effects to decide whether to use LLM feedback.

Two operational modes: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (multiple related languages with aggregated voting).

Key Findings

CAUSAL-MULTI outperforms prior methods on the evaluated benchmarks.

NumbersAverage improvement +3.5% vs best competing method (across 3 models × 2 datasets)

Practical UseWhen deploying multilingual abstention, use CAUSAL-MULTI to gain modest but consistent abstain-accuracy gains over existing strategies on similar QA benchmarks.

Evidence RefExperimental Results §4.2; Table 1

Filtering feedback via causal effect comparison is valuable; ignoring feedback hurts performance.

NumbersIgnoring feedback caused an average drop of ~45% and a max drop 9.3% in some languages

Practical UseDo not blindly apply all generated feedback. Use a selection step (NDE vs TIE) or you risk losing several points of abstention accuracy, especially in low-resource languages.

Evidence RefAblation §4.3; Table 2 (examples: Kannada drop 57.9%→48.6%)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy+3.5% vs best competing methodbest competing method (varies by model/dataset)+3.5%aggregate across three LLMs × two datasetsCAUSAL-MULTI outperforms strongest baseline in 4/6 settings and averages +3.5% (Table 1)Table 1; §4.2
Accuracy0.738MULTI-RELATED 0.725 (example)+0.013GPT-4o on M-MMLU overallTable 1 reports GPT-4o CAUSAL-MULTI overall = 0.738Table 1

What To Try In 7 Days

Run CAUSAL-NATIVE (N=3) on a small multilingual QA sample to measure abstain accuracy vs your current setup.

Implement NDE vs TIE comparison with JSD scores to filter feedback before final decisions.

If cross-language robustness matters, run CAUSAL-MULTI on key languages and measure cost vs benefit using the 10-call protocol.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Higher inference cost: CAUSAL-MULTI uses ~10 calls per query at N=3, increasing API expense.

Relies on LLM-generated feedback quality; severely incorrect feedback can still mislead if not properly filtered.

When Not To Use

When API cost per query must be minimal and you cannot afford multiple feedback calls.

When the target language lacks related languages in the chosen pool for CAUSAL-MULTI.

Failure Modes

Noisy or adversarial feedback in multiple languages can skew aggregated votes.

Incorrect language relatedness choices may reduce performance or produce wrong abstain choices.

Core Entities

Models

ChatGPTGPT-4oAya-13BLLaMa3.2Phi4

Metrics

Accuracy

Datasets

M-MMLU (Multilingual MMLU)M-Hellaswag (Multilingual Hellaswag)