Mask untruthful parts of context to cut hallucinations and keep helpful facts

Overview

Decision SnapshotReady For Pilot

TACS is practical and low-cost: small SVMs, fast training, and clear accuracy gains on 7B models. Effect size depends on base model and dataset; it filters context but does not add new facts.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Tian Yu, Shaolei Zhang, Yang Feng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TACS reduces hallucinations caused by bad context and is cheap to add to retrieval or prompt pipelines; it raises answer correctness on tested QA tasks without retraining the base model.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The paper introduces TACS, a lightweight add-on that detects which tokens or sentences in an input context are likely true, then masks out the rest so the LLM ignores misleading snippets. TACS trains small SVM classifiers on layer activations inside an LLM, builds token- or sentence-level attention masks, and plugs the masks into generation. On TruthfulQA and ConflictQA, TACS raises answer accuracy (examples: Llama-2-Chat 49.1% → 62.5% on TruthfulQA single-info; Mistral 54.7% → 77.1%), generalizes across similar 7B models, and trains in minutes. It does not inject new facts — it only filters context — so it helps when the model already knows the truth but may be misled by bad context.

Problem Statement

LLMs often follow coherent but false context from users or retrieval systems and produce hallucinations. We need a fast way to let LLMs accept helpful external facts while rejecting misleading or fabricated context without retraining the full model.

Main Contribution

TACS: a lightweight pipeline that detects truth at token/sentence level using classifiers on internal LLM activations and masks out low-truth tokens via attention masking.

A new metric, Disturbance Adaptation Rate (DA Rate), to measure how well a model accepts truthful info and resists untruthful info.

Key Findings

TACS substantially improves multiple-choice accuracy when context may be misleading.

NumbersLlama 2-Chat: Accuracy 49.1% → 62.5% (+13.4 pp) on TruthfulQA (single info)

Practical UseAdd TACS to RAG or input-preprocessing to get large accuracy gains when external context is noisy or adversarial.

Evidence RefTable 1 (generative multiple-choice, TruthfulQA single)

Some models gain very large improvements from TACS.

NumbersMistral-7B-Instruct: Accuracy 54.7% → 77.1% (+22.4 pp) on TruthfulQA (single info)

Practical UseTACS can be especially valuable for models whose base answers are sensitive to added context — expect big wins on vulnerable models.

Evidence RefTable 1 (generative multiple-choice, TruthfulQA single)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	62.5	49.1	+13.4 pp	TruthfulQA (single information)	Table 1: Llama 2-Chat + TACS-T vs baseline	Table 1
Accuracy	77.1	54.7	+22.4 pp	TruthfulQA (single information)	Table 1: Mistral-7B-Instruct + TACS-T vs baseline	Table 1

What To Try In 7 Days

Train token-level SVMs on layer activations from your 7B model using a small labeled set

Integrate attention masks from TACS into your generation step and compare accuracy on your QA data

Measure DA Rate (TA/UR/DA) to track acceptance vs resistance to retrieved facts

Agent Features

Memory

parametric memory (model internal knowledge)

Tool Use

retrieval augmentation

Frameworks

attention maskingSVM-based truth classifiers

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ictnlp/TACS

Data URLs

TruthfulQAConflictQA (constructed from PopQA/StrategyQA)

Risks & Boundaries

Limitations

TACS only filters input context; it does not supply new or corrected facts if the model lacks knowledge

Performance depends on the model's internal representations and the chosen threshold; thresholds differ across datasets

When Not To Use

When you must inject new verified facts that the LLM lacks

When the context must be preserved verbatim for provenance or legal reasons

Failure Modes

Discarding true but low-score tokens that hover near the threshold, reducing helpful context

Overcautious self-judgment by the LLM (self-selection performed poorly in experiments)

Core Entities

Models

Llama 2-Chat 7BMistral-7B-Instruct-v0.2Llama 2 7BVicuna-7B-v1.5

Metrics

AccuracyMC1/MC2/MC3True (%)Info (%)True*Info (%)TA RateUR RateDA Rate

Datasets

TruthfulQAConflictQAPopQA (subset used)StrategyQA (source of ConflictQA)

Benchmarks

TruthfulQAConflictQADisturbance Adaptation Rate (DA Rate)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TACS substantially improves multiple-choice accuracy when context may be misleading.

Some models gain very large improvements from TACS.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding