MUSE: an LLM + vision + credibility-aware web retrieval that corrects social media misinformation

March 17, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper reports strong, multi-pronged evidence (expert annotations, ablations, user study) but is limited to English and X and depends on commercial LLMs and MBFC ratings.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xinyi Zhou, Ashish Sharma, Amy X. Zhang, Tim Althoff

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MUSE can automate credible, auditable misinformation corrections at scale, reducing dependence on slow, expensive human fact-checking and improving user belief accuracy for platforms and publishers.

Who Should Care

Summary TLDR

MUSE combines a large language model with image captioning and credibility-aware web retrieval to identify and explain misinformation on social media with grounded links. Expert evaluation on X Community Notes shows MUSE outperforms GPT-4 (mean quality 8.1 vs 5.9) and high-helpfulness lay responses, reduces hallucinated links, works for images and text, and improves end-user ability to spot misinformation by 9.8%. Runtime ~2 minutes and cost ≈ $0.5/post (now ≈ $0.2).

Problem Statement

Social media posts often mix true, false, and misleading elements and can include images. Manual corrections scale poorly. Off-the-shelf LLMs are fluent but hallucinate, lack current knowledge, and struggle with images. We need a practical system that finds timely, credible evidence, handles visuals, and produces auditable corrections.

Main Contribution

MUSE system that augments an LLM with informative image captioning and credibility-aware web retrieval to produce grounded corrections.

A 13-dimension expert evaluation rubric covering identification, explanation, text quality, and references.

Key Findings

MUSE achieves higher overall expert-rated quality than baselines.

NumbersMean overall quality: MUSE 8.1, GPT-4 5.9, laypeople (high) 6.3

Practical UseUse retrieval+vision augmentation to substantially improve correction quality compared to raw LLM outputs or lay answers.

Evidence RefFig.2a; expert evaluation, N=232 tweets, N=464 responses

MUSE identifies and explains inaccuracies more often and more completely.

Numbers89% explicit identification; 61% full identification of all inaccuracies vs GPT-4 38%

Practical UseFor posts with mixed truths, MUSE finds more of the problematic parts and gives clearer explanations—good for automated moderation assistance.

Evidence RefFig.2b-d

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Expert overall quality (mean ± SD)MUSE 8.1 ± 2.0; GPT-4 5.9 ± 2.7; lay (high) 6.3 ± 2.0MUSE +37% vs GPT-4; +29% vs lay high232 tweets; expert ratingsFig.2a; MethodsFig.2a
Explicit identification rateMUSE 89% explicit identificationGPT-4 lower by 16 percentage pointsMUSE +16pp vs GPT-4expert-evaluated responsesFig.2b-eFig.2b

What To Try In 7 Days

Prototype a retrieval-augmented LLM pipeline: generate search queries from posts, retrieve top pages, and feed extracted evidence into your LLM.

Add an informative image captioning step (OCR + celebrity tags) to make image content LLM-readable.

Run a small A/B user test (100–1,000 users) comparing raw LLM responses vs retrieval-grounded responses and measure belief change.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No video input support; only text and images handled.

Evaluation is English-only and focused on X Community Notes posts.

When Not To Use

For video-based misinformation without an image snapshot.

In non-English deployments without retraining captioning and retrieval filters.

Failure Modes

Bad or biased retrieval sources lead to incorrect explanations.

Publisher credibility labels (MBFC) may be incomplete or contested.

Core Entities

Models

GPT-4Llama-3 (70B)BLIP-2

Metrics

Expert overall quality score (0-10)Identification/comprehensiveness percentagesReference reachability and relevanceEnd-user belief change (1-7 scale)

Datasets

X Community Notes (tweets and lay responses)

Context Entities

Models

msmarco-distilbert-base-tas-bfacebook/dino-vitb8BLIP-2

Datasets

MBFC publisher ratings