SheepDog: make fake-news detectors focus on content, not writing style, to resist LLM-based camouflage

Overview

Decision SnapshotReady For Pilot

The method is practical and modular, tested on three real benchmarks with ablations; costs rise where LLM API calls are used for reframing and attributions.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY 4.0

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Jiaying Wu, Jiafeng Guo, Bryan Hooi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Products that flag misinformation must be robust to attackers who use LLMs to change tone; adding style-variant training and content-focused cues reduces false negatives and improves trustworthiness.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

Style-focused features help fake-news detectors today but are a weakness: attackers can use LLMs to restyle fake articles to look trustworthy and drop detector F1 by up to 38%. The paper introduces SheepDog, a modular method that (1) generates style-diverse training copies with an LLM, (2) enforces consistent veracity predictions across those reframings, and (3) extracts content-centered debunking cues from an LLM as pseudo-labels. SheepDog improves robustness across three benchmarks and various backbones without losing accuracy on original articles.

Problem Statement

Text-based fake-news detectors often rely on writing style. Powerful LLMs let attackers rewrite fake articles to mimic trustworthy publishers. This style camouflage can severely drop detector performance, so detectors must learn to judge content rather than style.

Main Contribution

Demonstrates a new attack vector where LLMs restyle articles to evade text-based fake-news detectors.

Proposes SheepDog: a style-agnostic detector using LLM-generated reframings, style-alignment training, and LLM-sourced content attributions.

Key Findings

State-of-the-art text-only detectors suffer large drops under LLM style attacks.

NumbersF1 drop up to 38.33%

Practical UseDo not trust detector accuracy on current test sets alone; simulate LLM-style restyling when evaluating.

Evidence RefTable 1; Observation 1

SheepDog raises adversarial robustness across benchmarks.

NumbersAverage F1 gains vs baselines: 2.59%, 2.77%, 15.70% (three benchmarks)

Practical UseAdd LLM-reframing and style-alignment training to improve real-world resilience to style-based evasion.

Evidence RefSection 6.2; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Adversarial F1 drop under LLM style attack (best baseline)	up to -38.33%	state-of-the-art text detectors	—	PolitiFact / GossipCop / LUN (see Table 1)	Table 1 shows up to 38.33% F1 loss for LLaMA2 on LUN	Table 1; Observation 1
SheepDog adversarial F1 (PolitiFact, set A)	80.99%	best competitive baseline ~77.6%	+3.39% (vs baseline best)	PolitiFact (adversarial set A)	Table 3 reports SheepDog 80.99% F1 on PolitiFact A	Table 3

What To Try In 7 Days

Generate a small set of LLM-based style reframings for your labeled news and measure detection drop.

Add a style-alignment loss: enforce consistent predictions across original and reframed copies.

Elicit a short set of content-focused debunking labels (e.g., 'lack of credible sources'); use them as weak supervision.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseCC BY 4.0

Code URLs

https://github.com/jiayingwu19/SheepDog

Data URLs

FakeNewsNet (PolitiFact, GossipCop)LUN (Labeled Unreliable News)

Risks & Boundaries

Limitations

Relies on external LLMs to generate reframings and pseudo-labels, adding cost and potential bias.

LLM reframings preserve claims ~86–89% of the time; some reframings change content and can mislead the detector.

When Not To Use

When no reliable LLM access is available or budget for LLM calls is too small.

On tiny labeled datasets where generating many reframings risks overfitting to synthetic style noise.

Failure Modes

LLM-generated reframings that alter factual content create false negatives or incorrect pseudo-labels.

Attribution pseudo-labels can be noisy and push the model toward LLM biases.

Core Entities

Models

SheepDogRoBERTaBERTDeBERTaGPT-3.5InstructGPTLLaMA2-13B

Metrics

AccuracyMacro-F1F1 Score

Datasets

PolitiFactGossipCopLUNFakeNewsNet

Benchmarks

PolitiFactGossipCopLUN

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

State-of-the-art text-only detectors suffer large drops under LLM style attacks.

SheepDog raises adversarial robustness across benchmarks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding