Mix CLIP multimodal features with prompt tuning to detect fake news with few labels

April 9, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.65

Citation Count

0

Authors

Ye Jiang, Xiaomin Yu, Yimin Wang, Xiaoman Xu, Xingyi Song, Diana Maynard

Links

Abstract / PDF

Why It Matters For Business

SAMPLE helps detect multimodal fake news with far fewer labels and far fewer trainable parameters than full fine-tuning, lowering data and compute costs for real-world monitoring systems.

Summary TLDR

SAMPLE is a prompt-learning framework that fuses CLIP image-text features with RoBERTa-based prompts. It adaptively weights multimodal signals using standardized cosine similarity to reduce noisy, uncorrelated image-text pairs. On two benchmark datasets (PolitiFact, GossipCop) SAMPLE variants (discrete, continuous, mixed prompts) outperform standard RoBERTa fine-tuning, with the mixed prompt (M-SAMPLE) giving the largest gains, especially in few-shot settings. The method is lightweight (few trainable parameters) and practical for low-label regimes, though the soft verbalizer and uncorrelated modalities remain limitations.

Problem Statement

Fake news is hard to detect from text alone. Multimodal fusion helps but naïve fusion can add noise when image and text are weakly related. Fine-tuning large models needs lots of labeled data and updates many weights. The paper asks: can prompt learning plus a similarity-aware multimodal fusion (using CLIP) give strong fake-news detection with few labels and fewer trainable parameters?

Main Contribution

SAMPLE: a multimodal prompt-learning pipeline that combines discrete, continuous and mixed prompts with a soft verbalizer and RoBERTa.

A similarity-aware fusion step using CLIP features and standardized cosine similarity to scale multimodal intensity and reduce noisy cross-modal signals.

Empirical validation on PolitiFact and GossipCop showing consistent gains over fine-tuning and prior multimodal baselines in few-shot and data-rich settings.

Key Findings

Mixed prompting (M-SAMPLE) gives clear few-shot gains over standard fine-tuning.

Numbersavg F1 +0.05 vs FT-RoBERTa (few-shot)

SAMPLE can substantially beat older multimodal models on some settings.

Numbersup to +0.29 F1 vs CAFE (PolitiFact, 100-shot)

Standardizing similarity (the similarity-aware step) improves few-shot robustness.

NumbersPolitiFact 2-shot: M-SAMPLE F1 0.47 -> without similarity 0.44 (drop 0.03)

Results

Average F1 improvement (few-shot)

ValueM-SAMPLE +0.05 vs FT-RoBERTa

BaselineFT-RoBERTa

PolitiFact data-rich F1

Value0.80

BaselineFT-RoBERTa 0.79

GossipCop data-rich F1

Value0.64

BaselineFT-RoBERTa 0.63

Who Should Care

What To Try In 7 Days

Run a small test: extract CLIP features and apply mixed-prompt RoBERTa on a labeled sample (k=8) to compare F1 vs your fine-tuned model.

Implement the standardized cosine-similarity scaling when fusing image+text to see if it reduces noisy cross-modal cases.

Measure trainable parameters and training time: try prompt tuning to cut costs compared to full fine-tuning.

Optimization Features

Training Optimization

  • Prompt tuning reduces trainable parameters compared to full fine-tuning

Reproducibility

Data Urls

  • FakeNewsNet (Shu et al., 2018) datasets PolitiFact and GossipCop

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Soft verbalizer is hard to optimize with large vocabularies in very low-data regimes.
  • Similarity-aware fusion reduces noise but does not explicitly model or correct uncorrelated cross-modal relations.
  • Evaluations limited to PolitiFact and GossipCop; other domains and languages not tested.

When Not To Use

  • When you have abundant labeled data and can afford full fine-tuning: gains shrink in data-rich regimes.
  • When image and text are consistently missing or images are irrelevant: visual modality can hurt few-shot performance.
  • When you need out-of-the-box code: authors do not publish code in the paper.

Failure Modes

  • Visual features can inject noise when image-text correlation is low, hurting few-shot F1.
  • Soft verbalizer may bias predictions if label-word coverage is incomplete.
  • Performance and stability vary by dataset; GossipCop showed higher variance.

Core Entities

Models

  • CLIP (ViT-B-32)
  • RoBERTa
  • BERT
  • ResNet
  • VGG19

Metrics

  • macro-F1
  • Accuracy

Datasets

  • PolitiFact (FakeNewsNet)
  • GossipCop (FakeNewsNet)

Benchmarks

  • PolitiFact
  • GossipCop

Context Entities

Models

  • SpotFake
  • CAFE
  • LDA-HAN
  • T-BERT
  • SAFE
  • RIVF
  • FT-RoBERTa

Metrics

  • F1
  • Accuracy

Datasets

  • FakeNewsNet (Shu et al., 2018)