Overview
The benchmark and experiments provide clear evidence that retrieval improves automated judgments and that current Chinese LLMs lag experts; results are reproducible given the released code and data but the dataset is domain- and time-limited.
Citations4
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 40%
Novelty: 45%
Why It Matters For Business
If you deploy Chinese LLMs in maternal or infant health contexts, expect factual errors; CARE-MI helps measure and reduce that risk with an expert-validated benchmark and an automated judge that uses retrieved evidence.
Who Should Care
Summary TLDR
CARE-MI is a Chinese long-form (paragraph-level) benchmark focused on maternity and infant care to measure misinformation from LLMs. The authors built 1,612 expert-checked question samples from medical knowledge graphs and exam corpora, added retrieved supporting paragraphs, and provide trained "judgment" models that approximate human correctness and explanation scoring. Evaluations show current Chinese LLMs (GPT-4, GPT-3.5, ChatGLM, BELLE, MOSS, LLaMA variants) still lag human experts on correctness and reasoning, and that adding retrieved knowledge improves automatic judge performance (Pearson correctness 0.868 with knowledge). Code, data, and judge models are published.
Problem Statement
Existing misinformation evaluations focus on short tasks (multiple choice, single-token completion) and mainly English. There is no Chinese benchmark that measures long-form, knowledge-heavy misinformation in a sensitive domain like maternity and infant care. This prevents reliable automated evaluation and comparison of Chinese LLMs on harmful medical misinformation.
Main Contribution
CARE-MI dataset: 1,612 expert-checked Chinese long-form questions for maternity and infant care with supporting retrieved knowledge.
A reproducible synthetic pipeline that creates true/false statements, generates TF and open-ended questions, retrieves evidence, and filters via medical experts.
Key Findings
CARE-MI contains 1,612 expert-validated LF samples from an initial pool of 5,779 synthetic samples.
Top-performing models still fall short of a medical expert on correctness.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| CARE-MI size | 1,612 samples | — | — | CARE-MI final | Final benchmark after expert filtering | Section 3, Appendix A.1 |
| Best model correctness (All) | 0.867 | Human baseline 0.938 (200 samples) | −0.071 vs human | All | GPT-4 overall correctness; human evaluated on 200 questions | Table 4 |
What To Try In 7 Days
Run CARE-MI on your Chinese model to baseline factuality.
Add simple paragraph retrieval (BM25) and re-evaluate; judge accuracy improves with knowledge.
Use the provided LLaMA-13B-T judgment model to triage outputs that need human review.
Reproducibility
Risks & Boundaries
Limitations
Domain- and language-specific: only Chinese maternity and infant care long-form queries.
Not built from real user queries; may not reflect community question distribution.
When Not To Use
To evaluate models for other medical subdomains or non-Chinese languages.
As a substitute for clinical decision-making or patient-facing automated advice.
Failure Modes
Fluent but incorrect answers: models may give detailed wrong explanations.
Judge overconfidence: automated judges mirror labeler bias if training labels are biased.

