CARE-MI: a 1,612-sample Chinese benchmark to measure long-form misinformation in maternity and infant care

Overview

Decision SnapshotNeeds Validation

The benchmark and experiments provide clear evidence that retrieval improves automated judgments and that current Chinese LLMs lag experts; results are reproducible given the released code and data but the dataset is domain- and time-limited.

Citations4

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 45%

Authors

Tong Xiang, Liangzhi Li, Wangyue Li, Mingbai Bai, Lu Wei, Bowen Wang, Noa Garcia

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy Chinese LLMs in maternal or infant health contexts, expect factual errors; CARE-MI helps measure and reduce that risk with an expert-validated benchmark and an automated judge that uses retrieved evidence.

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

CARE-MI is a Chinese long-form (paragraph-level) benchmark focused on maternity and infant care to measure misinformation from LLMs. The authors built 1,612 expert-checked question samples from medical knowledge graphs and exam corpora, added retrieved supporting paragraphs, and provide trained "judgment" models that approximate human correctness and explanation scoring. Evaluations show current Chinese LLMs (GPT-4, GPT-3.5, ChatGLM, BELLE, MOSS, LLaMA variants) still lag human experts on correctness and reasoning, and that adding retrieved knowledge improves automatic judge performance (Pearson correctness 0.868 with knowledge). Code, data, and judge models are published.

Problem Statement

Existing misinformation evaluations focus on short tasks (multiple choice, single-token completion) and mainly English. There is no Chinese benchmark that measures long-form, knowledge-heavy misinformation in a sensitive domain like maternity and infant care. This prevents reliable automated evaluation and comparison of Chinese LLMs on harmful medical misinformation.

Main Contribution

CARE-MI dataset: 1,612 expert-checked Chinese long-form questions for maternity and infant care with supporting retrieved knowledge.

A reproducible synthetic pipeline that creates true/false statements, generates TF and open-ended questions, retrieves evidence, and filters via medical experts.

Key Findings

CARE-MI contains 1,612 expert-validated LF samples from an initial pool of 5,779 synthetic samples.

Numbers1,612 final samples (5,779 initial; 1,624 passing thresholds before 12 linguistic exclusions)

Practical UseYou can benchmark long-form Chinese medical answers at scale without building gold data from scratch; expect to reuse the pipeline to create domain-specific benchmarks.

Evidence RefSection 3, Appendix A.1, Figure 6

Top-performing models still fall short of a medical expert on correctness.

NumbersGPT-4 correctness 0.867 vs human baseline 0.938 (human eval on 200 samples)

Practical UseDo not rely on current LLMs alone for domain-sensitive medical advice; add expert review or retrieval-backed checks before deployment.

Evidence RefTable 4 (All column)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
CARE-MI size	1,612 samples	—	—	CARE-MI final	Final benchmark after expert filtering	Section 3, Appendix A.1
Best model correctness (All)	0.867	Human baseline 0.938 (200 samples)	−0.071 vs human	All	GPT-4 overall correctness; human evaluated on 200 questions	Table 4

What To Try In 7 Days

Run CARE-MI on your Chinese model to baseline factuality.

Add simple paragraph retrieval (BM25) and re-evaluate; judge accuracy improves with knowledge.

Use the provided LLaMA-13B-T judgment model to triage outputs that need human review.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Meetyou-AI-Lab/CARE-MI

Data URLs

https://github.com/Meetyou-AI-Lab/CARE-MI

Risks & Boundaries

Limitations

Domain- and language-specific: only Chinese maternity and infant care long-form queries.

Not built from real user queries; may not reflect community question distribution.

When Not To Use

To evaluate models for other medical subdomains or non-Chinese languages.

As a substitute for clinical decision-making or patient-facing automated advice.

Failure Modes

Fluent but incorrect answers: models may give detailed wrong explanations.

Judge overconfidence: automated judges mirror labeler bias if training labels are biased.

Core Entities

Models

GPT-4GPT-3.5-turboLLaMA-13B-TChatGLM-6BSFTBELLE-7B-2MBELLE-7B-0.2MBERT-LargeGPT-3-350MGPT-3-6.7B

Metrics

correctnessinterpretabilityPearson correlationAccuracyaverage score (0-1)

Datasets

CARE-MIBIOSCPubMedMLEC-QAMEDQAChinese WikipediaMedical books (Jin et al. 2020)

Benchmarks

CARE-MI

Context Entities

Models

Human expert baseline

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CARE-MI contains 1,612 expert-validated LF samples from an initial pool of 5,779 synthetic samples.

Top-performing models still fall short of a medical expert on correctness.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding