CARE-MI: a 1,612-sample Chinese benchmark to measure long-form misinformation in maternity and infant care

July 4, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.45

Cost Impact Score

0.4

Citation Count

4

Authors

Tong Xiang, Liangzhi Li, Wangyue Li, Mingbai Bai, Lu Wei, Bowen Wang, Noa Garcia

Links

Abstract / PDF

Why It Matters For Business

If you deploy Chinese LLMs in maternal or infant health contexts, expect factual errors; CARE-MI helps measure and reduce that risk with an expert-validated benchmark and an automated judge that uses retrieved evidence.

Summary TLDR

CARE-MI is a Chinese long-form (paragraph-level) benchmark focused on maternity and infant care to measure misinformation from LLMs. The authors built 1,612 expert-checked question samples from medical knowledge graphs and exam corpora, added retrieved supporting paragraphs, and provide trained "judgment" models that approximate human correctness and explanation scoring. Evaluations show current Chinese LLMs (GPT-4, GPT-3.5, ChatGLM, BELLE, MOSS, LLaMA variants) still lag human experts on correctness and reasoning, and that adding retrieved knowledge improves automatic judge performance (Pearson correctness 0.868 with knowledge). Code, data, and judge models are published.

Problem Statement

Existing misinformation evaluations focus on short tasks (multiple choice, single-token completion) and mainly English. There is no Chinese benchmark that measures long-form, knowledge-heavy misinformation in a sensitive domain like maternity and infant care. This prevents reliable automated evaluation and comparison of Chinese LLMs on harmful medical misinformation.

Main Contribution

CARE-MI dataset: 1,612 expert-checked Chinese long-form questions for maternity and infant care with supporting retrieved knowledge.

A reproducible synthetic pipeline that creates true/false statements, generates TF and open-ended questions, retrieves evidence, and filters via medical experts.

Trained judgment models (LLaMA-13B-T backbone) that approximate human expert scoring for correctness and interpretability; knowledge-aware judges perform better.

Key Findings

CARE-MI contains 1,612 expert-validated LF samples from an initial pool of 5,779 synthetic samples.

Numbers1,612 final samples (5,779 initial; 1,624 passing thresholds before 12 linguistic exclusions)

Top-performing models still fall short of a medical expert on correctness.

NumbersGPT-4 correctness 0.867 vs human baseline 0.938 (human eval on 200 samples)

Models perform much better on binary True/False (TF) questions than on open-ended (OE) questions.

NumbersMost models achieve ~0.8 correctness on TF but only GPT models reach ~0.6 on OE

Including retrieved knowledge improves automated judgment models.

NumbersJudgment model Pearson for correctness: 0.779 (w/o knowledge) → 0.868 (w/ knowledge)

Larger or more instruction-tuned models do not guarantee better factual correctness.

NumbersBELLE-7B-0.2M (smaller instruction set) correctness ↑0.023 but interpretability ↓0.083 vs BELLE-7B-2M

Results

CARE-MI size

Value1,612 samples

Best model correctness (All)

Value0.867

BaselineHuman baseline 0.938 (200 samples)

Judgment model Pearson (correctness) w/ knowledge

Value0.868

Baselinew/o knowledge 0.779

Average answer length (tokens)

Value123.1 tokens

Human-judge agreement (correctness)

ValueFleiss' kappa 0.755

Who Should Care

What To Try In 7 Days

Run CARE-MI on your Chinese model to baseline factuality.

Add simple paragraph retrieval (BM25) and re-evaluate; judge accuracy improves with knowledge.

Use the provided LLaMA-13B-T judgment model to triage outputs that need human review.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Domain- and language-specific: only Chinese maternity and infant care long-form queries.
  • Not built from real user queries; may not reflect community question distribution.
  • Knowledge can become outdated; answers are correct only for current clinical consensus.
  • Human annotations still carry subjective bias despite reported agreement.

When Not To Use

  • To evaluate models for other medical subdomains or non-Chinese languages.
  • As a substitute for clinical decision-making or patient-facing automated advice.
  • For long-term monitoring without periodic benchmark updates.

Failure Modes

  • Fluent but incorrect answers: models may give detailed wrong explanations.
  • Judge overconfidence: automated judges mirror labeler bias if training labels are biased.
  • Knowledge staleness: benchmark facts may become outdated and produce false negatives.
  • Coverage gaps: benchmark questions are expert-focused and may miss common community queries.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • LLaMA-13B-T
  • ChatGLM-6B
  • SFT
  • BELLE-7B-2M
  • BELLE-7B-0.2M
  • BERT-Large
  • GPT-3-350M
  • GPT-3-6.7B

Metrics

  • correctness
  • interpretability
  • Pearson correlation
  • Accuracy
  • average score (0-1)

Datasets

  • CARE-MI
  • BIOS
  • CPubMed
  • MLEC-QA
  • MEDQA
  • Chinese Wikipedia
  • Medical books (Jin et al. 2020)

Benchmarks

  • CARE-MI

Context Entities

Models

  • Human expert baseline