Mix CLIP multimodal features with prompt tuning to detect fake news with few labels

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.65

Citation Count

Authors

Ye Jiang, Xiaomin Yu, Yimin Wang, Xiaoman Xu, Xingyi Song, Diana Maynard

Links

Abstract / PDF

Why It Matters For Business

SAMPLE helps detect multimodal fake news with far fewer labels and far fewer trainable parameters than full fine-tuning, lowering data and compute costs for real-world monitoring systems.

Summary TLDR

SAMPLE is a prompt-learning framework that fuses CLIP image-text features with RoBERTa-based prompts. It adaptively weights multimodal signals using standardized cosine similarity to reduce noisy, uncorrelated image-text pairs. On two benchmark datasets (PolitiFact, GossipCop) SAMPLE variants (discrete, continuous, mixed prompts) outperform standard RoBERTa fine-tuning, with the mixed prompt (M-SAMPLE) giving the largest gains, especially in few-shot settings. The method is lightweight (few trainable parameters) and practical for low-label regimes, though the soft verbalizer and uncorrelated modalities remain limitations.

Problem Statement

Fake news is hard to detect from text alone. Multimodal fusion helps but naïve fusion can add noise when image and text are weakly related. Fine-tuning large models needs lots of labeled data and updates many weights. The paper asks: can prompt learning plus a similarity-aware multimodal fusion (using CLIP) give strong fake-news detection with few labels and fewer trainable parameters?

Main Contribution

SAMPLE: a multimodal prompt-learning pipeline that combines discrete, continuous and mixed prompts with a soft verbalizer and RoBERTa.

A similarity-aware fusion step using CLIP features and standardized cosine similarity to scale multimodal intensity and reduce noisy cross-modal signals.

Empirical validation on PolitiFact and GossipCop showing consistent gains over fine-tuning and prior multimodal baselines in few-shot and data-rich settings.

Key Findings

Mixed prompting (M-SAMPLE) gives clear few-shot gains over standard fine-tuning.

Numbersavg F1 +0.05 vs FT-RoBERTa (few-shot)

SAMPLE can substantially beat older multimodal models on some settings.

Numbersup to +0.29 F1 vs CAFE (PolitiFact, 100-shot)

Standardizing similarity (the similarity-aware step) improves few-shot robustness.

NumbersPolitiFact 2-shot: M-SAMPLE F1 0.47 -> without similarity 0.44 (drop 0.03)

Results

Average F1 improvement (few-shot)

ValueM-SAMPLE +0.05 vs FT-RoBERTa

BaselineFT-RoBERTa

PolitiFact data-rich F1

Value0.80

BaselineFT-RoBERTa 0.79

GossipCop data-rich F1

Value0.64

BaselineFT-RoBERTa 0.63

Who Should Care

Ml EngineerData ScientistProduct ManagerCtoEngineering Lead

What To Try In 7 Days

Run a small test: extract CLIP features and apply mixed-prompt RoBERTa on a labeled sample (k=8) to compare F1 vs your fine-tuned model.

Implement the standardized cosine-similarity scaling when fusing image+text to see if it reduces noisy cross-modal cases.

Measure trainable parameters and training time: try prompt tuning to cut costs compared to full fine-tuning.

Optimization Features

Training Optimization

Prompt tuning reduces trainable parameters compared to full fine-tuning

Reproducibility

Data Urls

FakeNewsNet (Shu et al., 2018) datasets PolitiFact and GossipCop

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Soft verbalizer is hard to optimize with large vocabularies in very low-data regimes.
Similarity-aware fusion reduces noise but does not explicitly model or correct uncorrelated cross-modal relations.
Evaluations limited to PolitiFact and GossipCop; other domains and languages not tested.

When Not To Use

When you have abundant labeled data and can afford full fine-tuning: gains shrink in data-rich regimes.
When image and text are consistently missing or images are irrelevant: visual modality can hurt few-shot performance.
When you need out-of-the-box code: authors do not publish code in the paper.

Failure Modes

Visual features can inject noise when image-text correlation is low, hurting few-shot F1.
Soft verbalizer may bias predictions if label-word coverage is incomplete.
Performance and stability vary by dataset; GossipCop showed higher variance.

Core Entities

Models

CLIP (ViT-B-32)
RoBERTa
BERT
ResNet
VGG19

Metrics

macro-F1
Accuracy

Datasets

PolitiFact (FakeNewsNet)
GossipCop (FakeNewsNet)

Benchmarks

PolitiFact
GossipCop

Context Entities

Models

SpotFake
CAFE
LDA-HAN
T-BERT
SAFE
RIVF
FT-RoBERTa

Metrics

F1
Accuracy

Datasets

FakeNewsNet (Shu et al., 2018)