Mask untruthful parts of context to cut hallucinations and keep helpful facts

March 12, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Tian Yu, Shaolei Zhang, Yang Feng

Links

Abstract / PDF

Why It Matters For Business

TACS reduces hallucinations caused by bad context and is cheap to add to retrieval or prompt pipelines; it raises answer correctness on tested QA tasks without retraining the base model.

Summary TLDR

The paper introduces TACS, a lightweight add-on that detects which tokens or sentences in an input context are likely true, then masks out the rest so the LLM ignores misleading snippets. TACS trains small SVM classifiers on layer activations inside an LLM, builds token- or sentence-level attention masks, and plugs the masks into generation. On TruthfulQA and ConflictQA, TACS raises answer accuracy (examples: Llama-2-Chat 49.1% → 62.5% on TruthfulQA single-info; Mistral 54.7% → 77.1%), generalizes across similar 7B models, and trains in minutes. It does not inject new facts — it only filters context — so it helps when the model already knows the truth but may be misled by bad context.

Problem Statement

LLMs often follow coherent but false context from users or retrieval systems and produce hallucinations. We need a fast way to let LLMs accept helpful external facts while rejecting misleading or fabricated context without retraining the full model.

Main Contribution

TACS: a lightweight pipeline that detects truth at token/sentence level using classifiers on internal LLM activations and masks out low-truth tokens via attention masking.

A new metric, Disturbance Adaptation Rate (DA Rate), to measure how well a model accepts truthful info and resists untruthful info.

Empirical validation on TruthfulQA and ConflictQA showing consistent accuracy and factuality gains across several 7B models and practical speed (SVMs train in minutes).

Key Findings

TACS substantially improves multiple-choice accuracy when context may be misleading.

NumbersLlama 2-Chat: Accuracy 49.1% → 62.5% (+13.4 pp) on TruthfulQA (single info)

Some models gain very large improvements from TACS.

NumbersMistral-7B-Instruct: Accuracy 54.7% → 77.1% (+22.4 pp) on TruthfulQA (single info)

TACS improves open-ended factuality scores (True*Info) and probabilistic selection (MC averages).

NumbersMistral True*Info 52.7% → 58.0% (+5.3 pp); MC avg 49.0 → 57.7 (+8.7 pp)

Classifier training and ensemble are fast and compact.

NumbersSVM classifiers train in about two minutes on TruthfulQA

Classifiers trained on one model generalize to similar models.

NumbersVicuna-v1.5 MC avg 37.1 → 45.6 (+8.5 pp) with TACS using SVMs trained on Llama 2-Chat

Results

Accuracy

Value62.5

Baseline49.1

Accuracy

Value77.1

Baseline54.7

Probabilistic multiple-choice MC average (TruthfulQA, single info)

Value57.7

Baseline49.0

True*Info (%) (open-ended, TruthfulQA, single info)

Value58.0

Baseline52.7

Who Should Care

What To Try In 7 Days

Train token-level SVMs on layer activations from your 7B model using a small labeled set

Integrate attention masks from TACS into your generation step and compare accuracy on your QA data

Measure DA Rate (TA/UR/DA) to track acceptance vs resistance to retrieved facts

Agent Features

Memory

  • parametric memory (model internal knowledge)

Tool Use

  • retrieval augmentation

Frameworks

  • attention masking
  • SVM-based truth classifiers

Reproducibility

Data Urls

  • TruthfulQA
  • ConflictQA (constructed from PopQA/StrategyQA)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • TACS only filters input context; it does not supply new or corrected facts if the model lacks knowledge
  • Performance depends on the model's internal representations and the chosen threshold; thresholds differ across datasets
  • Fine-grained masking risks removing true tokens when truth scores sit near the threshold

When Not To Use

  • When you must inject new verified facts that the LLM lacks
  • When the context must be preserved verbatim for provenance or legal reasons
  • When you lack labeled examples to train even small truth classifiers

Failure Modes

  • Discarding true but low-score tokens that hover near the threshold, reducing helpful context
  • Overcautious self-judgment by the LLM (self-selection performed poorly in experiments)
  • Coherent adversarial fabrications with features similar to truthful text may still pass detection

Core Entities

Models

  • Llama 2-Chat 7B
  • Mistral-7B-Instruct-v0.2
  • Llama 2 7B
  • Vicuna-7B-v1.5

Metrics

  • Accuracy
  • MC1/MC2/MC3
  • True (%)
  • Info (%)
  • True*Info (%)
  • TA Rate
  • UR Rate
  • DA Rate

Datasets

  • TruthfulQA
  • ConflictQA
  • PopQA (subset used)
  • StrategyQA (source of ConflictQA)

Benchmarks

  • TruthfulQA
  • ConflictQA
  • Disturbance Adaptation Rate (DA Rate)