SheepDog: make fake-news detectors focus on content, not writing style, to resist LLM-based camouflage

October 16, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is practical and modular, tested on three real benchmarks with ablations; costs rise where LLM API calls are used for reframing and attributions.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY 4.0

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Jiaying Wu, Jiafeng Guo, Bryan Hooi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Products that flag misinformation must be robust to attackers who use LLMs to change tone; adding style-variant training and content-focused cues reduces false negatives and improves trustworthiness.

Who Should Care

Summary TLDR

Style-focused features help fake-news detectors today but are a weakness: attackers can use LLMs to restyle fake articles to look trustworthy and drop detector F1 by up to 38%. The paper introduces SheepDog, a modular method that (1) generates style-diverse training copies with an LLM, (2) enforces consistent veracity predictions across those reframings, and (3) extracts content-centered debunking cues from an LLM as pseudo-labels. SheepDog improves robustness across three benchmarks and various backbones without losing accuracy on original articles.

Problem Statement

Text-based fake-news detectors often rely on writing style. Powerful LLMs let attackers rewrite fake articles to mimic trustworthy publishers. This style camouflage can severely drop detector performance, so detectors must learn to judge content rather than style.

Main Contribution

Demonstrates a new attack vector where LLMs restyle articles to evade text-based fake-news detectors.

Proposes SheepDog: a style-agnostic detector using LLM-generated reframings, style-alignment training, and LLM-sourced content attributions.

Key Findings

State-of-the-art text-only detectors suffer large drops under LLM style attacks.

NumbersF1 drop up to 38.33%

Practical UseDo not trust detector accuracy on current test sets alone; simulate LLM-style restyling when evaluating.

Evidence RefTable 1; Observation 1

SheepDog raises adversarial robustness across benchmarks.

NumbersAverage F1 gains vs baselines: 2.59%, 2.77%, 15.70% (three benchmarks)

Practical UseAdd LLM-reframing and style-alignment training to improve real-world resilience to style-based evasion.

Evidence RefSection 6.2; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Adversarial F1 drop under LLM style attack (best baseline)up to -38.33%state-of-the-art text detectorsPolitiFact / GossipCop / LUN (see Table 1)Table 1 shows up to 38.33% F1 loss for LLaMA2 on LUNTable 1; Observation 1
SheepDog adversarial F1 (PolitiFact, set A)80.99%best competitive baseline ~77.6%+3.39% (vs baseline best)PolitiFact (adversarial set A)Table 3 reports SheepDog 80.99% F1 on PolitiFact ATable 3

What To Try In 7 Days

Generate a small set of LLM-based style reframings for your labeled news and measure detection drop.

Add a style-alignment loss: enforce consistent predictions across original and reframed copies.

Elicit a short set of content-focused debunking labels (e.g., 'lack of credible sources'); use them as weak supervision.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseCC BY 4.0

Data URLs

FakeNewsNet (PolitiFact, GossipCop)LUN (Labeled Unreliable News)

Risks & Boundaries

Limitations

Relies on external LLMs to generate reframings and pseudo-labels, adding cost and potential bias.

LLM reframings preserve claims ~86–89% of the time; some reframings change content and can mislead the detector.

When Not To Use

When no reliable LLM access is available or budget for LLM calls is too small.

On tiny labeled datasets where generating many reframings risks overfitting to synthetic style noise.

Failure Modes

LLM-generated reframings that alter factual content create false negatives or incorrect pseudo-labels.

Attribution pseudo-labels can be noisy and push the model toward LLM biases.

Core Entities

Models

SheepDogRoBERTaBERTDeBERTaGPT-3.5InstructGPTLLaMA2-13B

Metrics

AccuracyMacro-F1F1 Score

Datasets

PolitiFactGossipCopLUNFakeNewsNet

Benchmarks

PolitiFactGossipCopLUN