Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
TACS reduces hallucinations caused by bad context and is cheap to add to retrieval or prompt pipelines; it raises answer correctness on tested QA tasks without retraining the base model.
Summary TLDR
The paper introduces TACS, a lightweight add-on that detects which tokens or sentences in an input context are likely true, then masks out the rest so the LLM ignores misleading snippets. TACS trains small SVM classifiers on layer activations inside an LLM, builds token- or sentence-level attention masks, and plugs the masks into generation. On TruthfulQA and ConflictQA, TACS raises answer accuracy (examples: Llama-2-Chat 49.1% → 62.5% on TruthfulQA single-info; Mistral 54.7% → 77.1%), generalizes across similar 7B models, and trains in minutes. It does not inject new facts — it only filters context — so it helps when the model already knows the truth but may be misled by bad context.
Problem Statement
LLMs often follow coherent but false context from users or retrieval systems and produce hallucinations. We need a fast way to let LLMs accept helpful external facts while rejecting misleading or fabricated context without retraining the full model.
Main Contribution
TACS: a lightweight pipeline that detects truth at token/sentence level using classifiers on internal LLM activations and masks out low-truth tokens via attention masking.
A new metric, Disturbance Adaptation Rate (DA Rate), to measure how well a model accepts truthful info and resists untruthful info.
Empirical validation on TruthfulQA and ConflictQA showing consistent accuracy and factuality gains across several 7B models and practical speed (SVMs train in minutes).
Key Findings
TACS substantially improves multiple-choice accuracy when context may be misleading.
Some models gain very large improvements from TACS.
TACS improves open-ended factuality scores (True*Info) and probabilistic selection (MC averages).
Classifier training and ensemble are fast and compact.
Classifiers trained on one model generalize to similar models.
Results
Accuracy
Accuracy
Probabilistic multiple-choice MC average (TruthfulQA, single info)
True*Info (%) (open-ended, TruthfulQA, single info)
Who Should Care
What To Try In 7 Days
Train token-level SVMs on layer activations from your 7B model using a small labeled set
Integrate attention masks from TACS into your generation step and compare accuracy on your QA data
Measure DA Rate (TA/UR/DA) to track acceptance vs resistance to retrieved facts
Agent Features
Memory
- parametric memory (model internal knowledge)
Tool Use
- retrieval augmentation
Frameworks
- attention masking
- SVM-based truth classifiers
Reproducibility
Code Urls
Data Urls
- TruthfulQA
- ConflictQA (constructed from PopQA/StrategyQA)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- TACS only filters input context; it does not supply new or corrected facts if the model lacks knowledge
- Performance depends on the model's internal representations and the chosen threshold; thresholds differ across datasets
- Fine-grained masking risks removing true tokens when truth scores sit near the threshold
When Not To Use
- When you must inject new verified facts that the LLM lacks
- When the context must be preserved verbatim for provenance or legal reasons
- When you lack labeled examples to train even small truth classifiers
Failure Modes
- Discarding true but low-score tokens that hover near the threshold, reducing helpful context
- Overcautious self-judgment by the LLM (self-selection performed poorly in experiments)
- Coherent adversarial fabrications with features similar to truthful text may still pass detection
Core Entities
Models
- Llama 2-Chat 7B
- Mistral-7B-Instruct-v0.2
- Llama 2 7B
- Vicuna-7B-v1.5
Metrics
- Accuracy
- MC1/MC2/MC3
- True (%)
- Info (%)
- True*Info (%)
- TA Rate
- UR Rate
- DA Rate
Datasets
- TruthfulQA
- ConflictQA
- PopQA (subset used)
- StrategyQA (source of ConflictQA)
Benchmarks
- TruthfulQA
- ConflictQA
- Disturbance Adaptation Rate (DA Rate)

