Mask untruthful parts of context to cut hallucinations and keep helpful facts

March 12, 20247 min

Overview

Decision SnapshotReady For Pilot

TACS is practical and low-cost: small SVMs, fast training, and clear accuracy gains on 7B models. Effect size depends on base model and dataset; it filters context but does not add new facts.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Tian Yu, Shaolei Zhang, Yang Feng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TACS reduces hallucinations caused by bad context and is cheap to add to retrieval or prompt pipelines; it raises answer correctness on tested QA tasks without retraining the base model.

Who Should Care

Summary TLDR

The paper introduces TACS, a lightweight add-on that detects which tokens or sentences in an input context are likely true, then masks out the rest so the LLM ignores misleading snippets. TACS trains small SVM classifiers on layer activations inside an LLM, builds token- or sentence-level attention masks, and plugs the masks into generation. On TruthfulQA and ConflictQA, TACS raises answer accuracy (examples: Llama-2-Chat 49.1% → 62.5% on TruthfulQA single-info; Mistral 54.7% → 77.1%), generalizes across similar 7B models, and trains in minutes. It does not inject new facts — it only filters context — so it helps when the model already knows the truth but may be misled by bad context.

Problem Statement

LLMs often follow coherent but false context from users or retrieval systems and produce hallucinations. We need a fast way to let LLMs accept helpful external facts while rejecting misleading or fabricated context without retraining the full model.

Main Contribution

TACS: a lightweight pipeline that detects truth at token/sentence level using classifiers on internal LLM activations and masks out low-truth tokens via attention masking.

A new metric, Disturbance Adaptation Rate (DA Rate), to measure how well a model accepts truthful info and resists untruthful info.

Key Findings

TACS substantially improves multiple-choice accuracy when context may be misleading.

NumbersLlama 2-Chat: Accuracy 49.1%62.5% (+13.4 pp) on TruthfulQA (single info)

Practical UseAdd TACS to RAG or input-preprocessing to get large accuracy gains when external context is noisy or adversarial.

Evidence RefTable 1 (generative multiple-choice, TruthfulQA single)

Some models gain very large improvements from TACS.

NumbersMistral-7B-Instruct: Accuracy 54.7%77.1% (+22.4 pp) on TruthfulQA (single info)

Practical UseTACS can be especially valuable for models whose base answers are sensitive to added context — expect big wins on vulnerable models.

Evidence RefTable 1 (generative multiple-choice, TruthfulQA single)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy62.549.1+13.4 ppTruthfulQA (single information)Table 1: Llama 2-Chat + TACS-T vs baselineTable 1
Accuracy77.154.7+22.4 ppTruthfulQA (single information)Table 1: Mistral-7B-Instruct + TACS-T vs baselineTable 1

What To Try In 7 Days

Train token-level SVMs on layer activations from your 7B model using a small labeled set

Integrate attention masks from TACS into your generation step and compare accuracy on your QA data

Measure DA Rate (TA/UR/DA) to track acceptance vs resistance to retrieved facts

Agent Features

Memory
parametric memory (model internal knowledge)
Tool Use
retrieval augmentation
Frameworks
attention maskingSVM-based truth classifiers

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

TruthfulQAConflictQA (constructed from PopQA/StrategyQA)

Risks & Boundaries

Limitations

TACS only filters input context; it does not supply new or corrected facts if the model lacks knowledge

Performance depends on the model's internal representations and the chosen threshold; thresholds differ across datasets

When Not To Use

When you must inject new verified facts that the LLM lacks

When the context must be preserved verbatim for provenance or legal reasons

Failure Modes

Discarding true but low-score tokens that hover near the threshold, reducing helpful context

Overcautious self-judgment by the LLM (self-selection performed poorly in experiments)

Core Entities

Models

Llama 2-Chat 7BMistral-7B-Instruct-v0.2Llama 2 7BVicuna-7B-v1.5

Metrics

AccuracyMC1/MC2/MC3True (%)Info (%)True*Info (%)TA RateUR RateDA Rate

Datasets

TruthfulQAConflictQAPopQA (subset used)StrategyQA (source of ConflictQA)

Benchmarks

TruthfulQAConflictQADisturbance Adaptation Rate (DA Rate)