Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
5
Why It Matters For Business
Products that flag misinformation must be robust to attackers who use LLMs to change tone; adding style-variant training and content-focused cues reduces false negatives and improves trustworthiness.
Summary TLDR
Style-focused features help fake-news detectors today but are a weakness: attackers can use LLMs to restyle fake articles to look trustworthy and drop detector F1 by up to 38%. The paper introduces SheepDog, a modular method that (1) generates style-diverse training copies with an LLM, (2) enforces consistent veracity predictions across those reframings, and (3) extracts content-centered debunking cues from an LLM as pseudo-labels. SheepDog improves robustness across three benchmarks and various backbones without losing accuracy on original articles.
Problem Statement
Text-based fake-news detectors often rely on writing style. Powerful LLMs let attackers rewrite fake articles to mimic trustworthy publishers. This style camouflage can severely drop detector performance, so detectors must learn to judge content rather than style.
Main Contribution
Demonstrates a new attack vector where LLMs restyle articles to evade text-based fake-news detectors.
Proposes SheepDog: a style-agnostic detector using LLM-generated reframings, style-alignment training, and LLM-sourced content attributions.
Extensive experiments showing SheepDog improves adversarial robustness across PolitiFact, GossipCop, and LUN and works with multiple LM/LLM backbones.
Key Findings
State-of-the-art text-only detectors suffer large drops under LLM style attacks.
SheepDog raises adversarial robustness across benchmarks.
LLM reframings largely preserve content claims.
The reframing component is critical to SheepDog's gains.
Results
Adversarial F1 drop under LLM style attack (best baseline)
SheepDog adversarial F1 (PolitiFact, set A)
SheepDog unperturbed F1 (LUN original test)
Ablation: remove reframings (SheepDog -R) F1
Claim entailment between original and objective reframings
Who Should Care
What To Try In 7 Days
Generate a small set of LLM-based style reframings for your labeled news and measure detection drop.
Add a style-alignment loss: enforce consistent predictions across original and reframed copies.
Elicit a short set of content-focused debunking labels (e.g., 'lack of credible sources'); use them as weak supervision.
Reproducibility
License
- CC BY 4.0
Data Urls
- FakeNewsNet (PolitiFact, GossipCop)
- LUN (Labeled Unreliable News)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Relies on external LLMs to generate reframings and pseudo-labels, adding cost and potential bias.
- LLM reframings preserve claims ~86–89% of the time; some reframings change content and can mislead the detector.
- Evaluation limited to text-only datasets; multimodal news not studied.
When Not To Use
- When no reliable LLM access is available or budget for LLM calls is too small.
- On tiny labeled datasets where generating many reframings risks overfitting to synthetic style noise.
Failure Modes
- LLM-generated reframings that alter factual content create false negatives or incorrect pseudo-labels.
- Attribution pseudo-labels can be noisy and push the model toward LLM biases.
- Attackers could alter facts, not just style, which this style-agnostic method does not directly address.
Core Entities
Models
- SheepDog
- RoBERTa
- BERT
- DeBERTa
- GPT-3.5
- InstructGPT
- LLaMA2-13B
Metrics
- Accuracy
- Macro-F1
- F1 Score
Datasets
- PolitiFact
- GossipCop
- LUN
- FakeNewsNet
Benchmarks
- PolitiFact
- GossipCop
- LUN

