SheepDog: make fake-news detectors focus on content, not writing style, to resist LLM-based camouflage

October 16, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

5

Authors

Jiaying Wu, Jiafeng Guo, Bryan Hooi

Links

Abstract / PDF

Why It Matters For Business

Products that flag misinformation must be robust to attackers who use LLMs to change tone; adding style-variant training and content-focused cues reduces false negatives and improves trustworthiness.

Summary TLDR

Style-focused features help fake-news detectors today but are a weakness: attackers can use LLMs to restyle fake articles to look trustworthy and drop detector F1 by up to 38%. The paper introduces SheepDog, a modular method that (1) generates style-diverse training copies with an LLM, (2) enforces consistent veracity predictions across those reframings, and (3) extracts content-centered debunking cues from an LLM as pseudo-labels. SheepDog improves robustness across three benchmarks and various backbones without losing accuracy on original articles.

Problem Statement

Text-based fake-news detectors often rely on writing style. Powerful LLMs let attackers rewrite fake articles to mimic trustworthy publishers. This style camouflage can severely drop detector performance, so detectors must learn to judge content rather than style.

Main Contribution

Demonstrates a new attack vector where LLMs restyle articles to evade text-based fake-news detectors.

Proposes SheepDog: a style-agnostic detector using LLM-generated reframings, style-alignment training, and LLM-sourced content attributions.

Extensive experiments showing SheepDog improves adversarial robustness across PolitiFact, GossipCop, and LUN and works with multiple LM/LLM backbones.

Key Findings

State-of-the-art text-only detectors suffer large drops under LLM style attacks.

NumbersF1 drop up to 38.33%

SheepDog raises adversarial robustness across benchmarks.

NumbersAverage F1 gains vs baselines: 2.59%, 2.77%, 15.70% (three benchmarks)

LLM reframings largely preserve content claims.

NumbersClaim entailment between original and objective reframings: 86.2%–89.2%

The reframing component is critical to SheepDog's gains.

NumbersRemoving reframings (SheepDog -R) drops LUN F1 from 85.63% to 53.27%

Results

Adversarial F1 drop under LLM style attack (best baseline)

Valueup to -38.33%

Baselinestate-of-the-art text detectors

SheepDog adversarial F1 (PolitiFact, set A)

Value80.99%

Baselinebest competitive baseline ~77.6%

SheepDog unperturbed F1 (LUN original test)

Value93.04%

Baselinebest baseline ~84.06%

Ablation: remove reframings (SheepDog -R) F1

Value53.27%

BaselineSheepDog full 85.63%

Claim entailment between original and objective reframings

Value86.2%–89.2%

Who Should Care

What To Try In 7 Days

Generate a small set of LLM-based style reframings for your labeled news and measure detection drop.

Add a style-alignment loss: enforce consistent predictions across original and reframed copies.

Elicit a short set of content-focused debunking labels (e.g., 'lack of credible sources'); use them as weak supervision.

Reproducibility

License

  • CC BY 4.0

Data Urls

  • FakeNewsNet (PolitiFact, GossipCop)
  • LUN (Labeled Unreliable News)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Relies on external LLMs to generate reframings and pseudo-labels, adding cost and potential bias.
  • LLM reframings preserve claims ~86–89% of the time; some reframings change content and can mislead the detector.
  • Evaluation limited to text-only datasets; multimodal news not studied.

When Not To Use

  • When no reliable LLM access is available or budget for LLM calls is too small.
  • On tiny labeled datasets where generating many reframings risks overfitting to synthetic style noise.

Failure Modes

  • LLM-generated reframings that alter factual content create false negatives or incorrect pseudo-labels.
  • Attribution pseudo-labels can be noisy and push the model toward LLM biases.
  • Attackers could alter facts, not just style, which this style-agnostic method does not directly address.

Core Entities

Models

  • SheepDog
  • RoBERTa
  • BERT
  • DeBERTa
  • GPT-3.5
  • InstructGPT
  • LLaMA2-13B

Metrics

  • Accuracy
  • Macro-F1
  • F1 Score

Datasets

  • PolitiFact
  • GossipCop
  • LUN
  • FakeNewsNet

Benchmarks

  • PolitiFact
  • GossipCop
  • LUN