MUSE: an LLM + vision + credibility-aware web retrieval that corrects social media misinformation

Overview

Decision SnapshotNeeds Validation

The paper reports strong, multi-pronged evidence (expert annotations, ablations, user study) but is limited to English and X and depends on commercial LLMs and MBFC ratings.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xinyi Zhou, Ashish Sharma, Amy X. Zhang, Tim Althoff

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MUSE can automate credible, auditable misinformation corrections at scale, reducing dependence on slow, expensive human fact-checking and improving user belief accuracy for platforms and publishers.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

MUSE combines a large language model with image captioning and credibility-aware web retrieval to identify and explain misinformation on social media with grounded links. Expert evaluation on X Community Notes shows MUSE outperforms GPT-4 (mean quality 8.1 vs 5.9) and high-helpfulness lay responses, reduces hallucinated links, works for images and text, and improves end-user ability to spot misinformation by 9.8%. Runtime ~2 minutes and cost ≈ $0.5/post (now ≈ $0.2).

Problem Statement

Social media posts often mix true, false, and misleading elements and can include images. Manual corrections scale poorly. Off-the-shelf LLMs are fluent but hallucinate, lack current knowledge, and struggle with images. We need a practical system that finds timely, credible evidence, handles visuals, and produces auditable corrections.

Main Contribution

MUSE system that augments an LLM with informative image captioning and credibility-aware web retrieval to produce grounded corrections.

A 13-dimension expert evaluation rubric covering identification, explanation, text quality, and references.

Key Findings

MUSE achieves higher overall expert-rated quality than baselines.

NumbersMean overall quality: MUSE 8.1, GPT-4 5.9, laypeople (high) 6.3

Practical UseUse retrieval+vision augmentation to substantially improve correction quality compared to raw LLM outputs or lay answers.

Evidence RefFig.2a; expert evaluation, N=232 tweets, N=464 responses

MUSE identifies and explains inaccuracies more often and more completely.

Numbers89% explicit identification; 61% full identification of all inaccuracies vs GPT-4 38%

Practical UseFor posts with mixed truths, MUSE finds more of the problematic parts and gives clearer explanations—good for automated moderation assistance.

Evidence RefFig.2b-d

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Expert overall quality (mean ± SD)	MUSE 8.1 ± 2.0; GPT-4 5.9 ± 2.7; lay (high) 6.3 ± 2.0	—	MUSE +37% vs GPT-4; +29% vs lay high	232 tweets; expert ratings	Fig.2a; Methods	Fig.2a
Explicit identification rate	MUSE 89% explicit identification	GPT-4 lower by 16 percentage points	MUSE +16pp vs GPT-4	expert-evaluated responses	Fig.2b-e	Fig.2b

What To Try In 7 Days

Prototype a retrieval-augmented LLM pipeline: generate search queries from posts, retrieve top pages, and feed extracted evidence into your LLM.

Add an informative image captioning step (OCR + celebrity tags) to make image content LLM-readable.

Run a small A/B user test (100–1,000 users) comparing raw LLM responses vs retrieval-grounded responses and measure belief change.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Social-Futures-Lab/MUSE

Data URLs

https://github.com/Social-Futures-Lab/MUSE (tweet IDs and code)

Risks & Boundaries

Limitations

No video input support; only text and images handled.

Evaluation is English-only and focused on X Community Notes posts.

When Not To Use

For video-based misinformation without an image snapshot.

In non-English deployments without retraining captioning and retrieval filters.

Failure Modes

Bad or biased retrieval sources lead to incorrect explanations.

Publisher credibility labels (MBFC) may be incomplete or contested.

Core Entities

Models

GPT-4Llama-3 (70B)BLIP-2

Metrics

Expert overall quality score (0-10)Identification/comprehensiveness percentagesReference reachability and relevanceEnd-user belief change (1-7 scale)

Datasets

X Community Notes (tweets and lay responses)

Context Entities

Models

msmarco-distilbert-base-tas-bfacebook/dino-vitb8BLIP-2

Datasets

MBFC publisher ratings

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MUSE achieves higher overall expert-rated quality than baselines.

MUSE identifies and explains inaccuracies more often and more completely.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Datasets

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding