Use causal effects on multilingual feedback to decide when LLMs should abstain

May 31, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Yuxi Sun, Aoqi Zuo, Wei Gao, Jing Ma

Links

Abstract / PDF

Why It Matters For Business

CausalAbstain reduces wrong answers across multiple languages by selectively using model feedback, improving trust in multilingual QA systems while trading off higher API cost for better safety.

Summary TLDR

The paper introduces CausalAbstain, a training-free way to decide whether to use LLM-generated feedback when an LLM evaluates its own answers in multiple languages. It measures causal paths from an answer to the abstention decision using natural direct effect (NDE) and total indirect effect (TIE) computed from repeated prompts and Jensen‑Shannon divergence. Two modes are proposed: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (related-language feedback with aggregated voting). Evaluated on multilingual M-MMLU and M-Hellaswag with ChatGPT, GPT-4o, Aya-13B (and additional tests on LLaMa/Phi), CAUSAL-MULTI improves abstain accuracy vs strong baselines (avg +3.5% on evaluated setups)

Problem Statement

LLMs hallucinate more in low-resource languages. Existing feedback-based abstention methods simply trust model-generated feedback, which can be incorrect or biased across languages. The problem is deciding when to use generated feedback and which feedback to trust so the model abstains correctly instead of amplifying errors.

Main Contribution

A causality-based, training-free framework (CausalAbstain) that compares direct and feedback-mediated effects to decide whether to use LLM feedback.

Two operational modes: CAUSAL-NATIVE (native-language feedback) and CAUSAL-MULTI (multiple related languages with aggregated voting).

Extensive multilingual experiments showing CAUSAL-MULTI improves abstention accuracy over calibration, prompting, consistency, and feedback baselines on M-MMLU and M-Hellaswag.

Key Findings

CAUSAL-MULTI outperforms prior methods on the evaluated benchmarks.

NumbersAverage improvement +3.5% vs best competing method (across 3 models × 2 datasets)

Filtering feedback via causal effect comparison is valuable; ignoring feedback hurts performance.

NumbersIgnoring feedback caused an average drop of ~4–5% and a max drop 9.3% in some languages

CAUSAL-MULTI is costlier in API calls but more robust across languages.

NumbersWith N=3 iterations: CAUSAL-MULTI uses 10 LLM requests/query; CAUSAL-NATIVE uses 4

Results

Accuracy

Value+3.5% vs best competing method

Baselinebest competing method (varies by model/dataset)

Accuracy

Value0.738

BaselineMULTI-RELATED 0.725 (example)

Inference requests per query (cost)

ValueCAUSAL-MULTI: 10 requests (N=3)

BaselineCAUSAL-NATIVE: 4 requests (N=3)

Who Should Care

What To Try In 7 Days

Run CAUSAL-NATIVE (N=3) on a small multilingual QA sample to measure abstain accuracy vs your current setup.

Implement NDE vs TIE comparison with JSD scores to filter feedback before final decisions.

If cross-language robustness matters, run CAUSAL-MULTI on key languages and measure cost vs benefit using the 10-call protocol.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher inference cost: CAUSAL-MULTI uses ~10 calls per query at N=3, increasing API expense.
  • Relies on LLM-generated feedback quality; severely incorrect feedback can still mislead if not properly filtered.
  • Performance depends on choice of related languages and pretraining representation; low-resource languages remain challenging.

When Not To Use

  • When API cost per query must be minimal and you cannot afford multiple feedback calls.
  • When the target language lacks related languages in the chosen pool for CAUSAL-MULTI.
  • For tasks where abstention decisions require external factual verification rather than self-reflection.

Failure Modes

  • Noisy or adversarial feedback in multiple languages can skew aggregated votes.
  • Incorrect language relatedness choices may reduce performance or produce wrong abstain choices.
  • Smaller models may produce low-quality feedback; aggregation may not fully correct systematic biases.

Core Entities

Models

  • ChatGPT
  • GPT-4o
  • Aya-13B
  • LLaMa3.2
  • Phi4

Metrics

  • Accuracy

Datasets

  • M-MMLU (Multilingual MMLU)
  • M-Hellaswag (Multilingual Hellaswag)