Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.65
Citation Count
0
Why It Matters For Business
SAMPLE helps detect multimodal fake news with far fewer labels and far fewer trainable parameters than full fine-tuning, lowering data and compute costs for real-world monitoring systems.
Summary TLDR
SAMPLE is a prompt-learning framework that fuses CLIP image-text features with RoBERTa-based prompts. It adaptively weights multimodal signals using standardized cosine similarity to reduce noisy, uncorrelated image-text pairs. On two benchmark datasets (PolitiFact, GossipCop) SAMPLE variants (discrete, continuous, mixed prompts) outperform standard RoBERTa fine-tuning, with the mixed prompt (M-SAMPLE) giving the largest gains, especially in few-shot settings. The method is lightweight (few trainable parameters) and practical for low-label regimes, though the soft verbalizer and uncorrelated modalities remain limitations.
Problem Statement
Fake news is hard to detect from text alone. Multimodal fusion helps but naïve fusion can add noise when image and text are weakly related. Fine-tuning large models needs lots of labeled data and updates many weights. The paper asks: can prompt learning plus a similarity-aware multimodal fusion (using CLIP) give strong fake-news detection with few labels and fewer trainable parameters?
Main Contribution
SAMPLE: a multimodal prompt-learning pipeline that combines discrete, continuous and mixed prompts with a soft verbalizer and RoBERTa.
A similarity-aware fusion step using CLIP features and standardized cosine similarity to scale multimodal intensity and reduce noisy cross-modal signals.
Empirical validation on PolitiFact and GossipCop showing consistent gains over fine-tuning and prior multimodal baselines in few-shot and data-rich settings.
Key Findings
Mixed prompting (M-SAMPLE) gives clear few-shot gains over standard fine-tuning.
SAMPLE can substantially beat older multimodal models on some settings.
Standardizing similarity (the similarity-aware step) improves few-shot robustness.
Results
Average F1 improvement (few-shot)
PolitiFact data-rich F1
GossipCop data-rich F1
Who Should Care
What To Try In 7 Days
Run a small test: extract CLIP features and apply mixed-prompt RoBERTa on a labeled sample (k=8) to compare F1 vs your fine-tuned model.
Implement the standardized cosine-similarity scaling when fusing image+text to see if it reduces noisy cross-modal cases.
Measure trainable parameters and training time: try prompt tuning to cut costs compared to full fine-tuning.
Optimization Features
Training Optimization
- Prompt tuning reduces trainable parameters compared to full fine-tuning
Reproducibility
Data Urls
- FakeNewsNet (Shu et al., 2018) datasets PolitiFact and GossipCop
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Soft verbalizer is hard to optimize with large vocabularies in very low-data regimes.
- Similarity-aware fusion reduces noise but does not explicitly model or correct uncorrelated cross-modal relations.
- Evaluations limited to PolitiFact and GossipCop; other domains and languages not tested.
When Not To Use
- When you have abundant labeled data and can afford full fine-tuning: gains shrink in data-rich regimes.
- When image and text are consistently missing or images are irrelevant: visual modality can hurt few-shot performance.
- When you need out-of-the-box code: authors do not publish code in the paper.
Failure Modes
- Visual features can inject noise when image-text correlation is low, hurting few-shot F1.
- Soft verbalizer may bias predictions if label-word coverage is incomplete.
- Performance and stability vary by dataset; GossipCop showed higher variance.
Core Entities
Models
- CLIP (ViT-B-32)
- RoBERTa
- BERT
- ResNet
- VGG19
Metrics
- macro-F1
- Accuracy
Datasets
- PolitiFact (FakeNewsNet)
- GossipCop (FakeNewsNet)
Benchmarks
- PolitiFact
- GossipCop
Context Entities
Models
- SpotFake
- CAFE
- LDA-HAN
- T-BERT
- SAFE
- RIVF
- FT-RoBERTa
Metrics
- F1
- Accuracy
Datasets
- FakeNewsNet (Shu et al., 2018)

