ASTRID: three automated, scalable metrics (CF, RA, CR) to evaluate RAG clinical QA

January 14, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

1

Authors

Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim

Links

Abstract / PDF

Why It Matters For Business

ASTRID gives an automated, clinically validated way to detect ungrounded, out-of-scope, or irrelevant answers; this reduces expensive clinician review and speeds safe iterative development of RAG-based clinical agents.

Summary TLDR

ASTRID is a simple, automatable triad of metrics for RAG-based clinical question answering: Conversational Faithfulness (CF) measures how much information in an answer is grounded in retrieved context; Refusal Accuracy (RA) checks whether the system correctly declines out-of-scope questions; Context Relevance (CR) checks whether the retrieved context matches the question. On a cataract post-op dataset and clinician-augmented examples, CF aligns much better with human perceived faithfulness (AUC 0.98 vs 0.83) and the triad plus a scope label predicts clinician-rated harmfulness (avg F1 ≈ 0.835) and helpfulness (avg F1 ≈ 0.715). Several large LLMs can compute these metrics automatically with C

Problem Statement

Existing RAG evaluation metrics break in conversational clinical settings: they either fragment responses into statements (losing nuance), mis-handle empathetic or clarifying dialogue, or fail to detect when a system should rightly refuse to answer. Human clinical review is accurate but too costly and slow for iterative development. Developers need automated, validated metrics that map to clinical risk and can be run continuously.

Main Contribution

A safety-oriented hazard analysis for clinical RAG QA systems guided by SACE principles.

ASTRID: three reference-free, LLM-based metrics—Conversational Faithfulness (CF), Refusal Accuracy (RA), Context Relevance (CR)—designed for conversational clinical QA.

Curated and released datasets: FaithfulnessQAC, UniqueQAC, and ClinicalQAC built from real post-op cataract patient questions plus clinician-augmented failure cases.

Empirical validation showing CF aligns with human perceived faithfulness better than a statement-level baseline and that the triad predicts clinician labels for harm, helpfulness, and inappropriateness.

Analysis showing several current large LLMs can automate ASTRID metrics, enabling scalable evaluation pipelines.

Key Findings

Conversational Faithfulness (CF) matches human perceived faithfulness much better than statement-level faithfulness (RF).

NumbersAUC CF=0.98 vs RF=0.83; Pearson CF vs PF=0.90, RF vs PF=0.57

ASTRID's triad (CF, CR, RA) plus a scope label predicts clinician-rated harmfulness and helpfulness accurately.

NumbersAverage F1 harmfulness ≈ 0.835; helpfulness ≈ 0.715

Large LLMs can automate ASTRID metrics with reasonable agreement to human labels; smaller models struggle on CF.

NumbersGPT-o3-mini: CR acc 0.87, RA acc 0.95; claude-3.5-sonnet CF F1 0.74; llama-3-8B CF F1 0.05

Results

CF vs RF (human alignment)

ValueCF AUC 0.98; RF AUC 0.83

BaselineRF (statement-level)

Correlation CF vs PF

ValuePearson 0.90; Spearman 0.90; Kendall Tau 0.84

BaselineRF Pearson 0.57

Clinical harm prediction (triad + scope)

ValueAvg F1 harmfulness ≈ 0.835

Clinical helpfulness prediction (triad + scope)

ValueAvg F1 helpfulness ≈ 0.715

LLM automation example: CR & RA

Valuegpt-o3-mini CR acc 0.87; RA acc 0.95

LLM automation example: CF F1

Valueclaude-3.5-sonnet CF F1 0.74; gemini-2-flash CF F1 0.77

Baselinesmaller models like llama-3-8B CF F1 0.05

Who Should Care

What To Try In 7 Days

Run ASTRID CF/CR/RA prompts on 200 recent RAG outputs to surface ungrounded answers.

Compare CF scores to a small clinician sample (50 examples) to calibrate thresholds.

Automate RA to block or flag out-of-scope replies before deployment.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-turn focus: does not evaluate multi-turn dialogue continuity or cumulative risk.
  • Single clinical domain: validated mainly on cataract post-op data; generalization to other specialties is untested.
  • Dataset size: datasets are modest (≈238 triplets total) and were balanced by augmentation.
  • Usability gaps: no metrics for empathy, latency, brevity, or transcription robustness.

When Not To Use

  • For end-to-end multi-turn safety assessment without extension to dialogue continuity.
  • For non-clinical or broad general-domain QA where different retrieval dynamics apply.
  • When you need user-experience metrics like empathy or satisfaction.

Failure Modes

  • LLM-as-judge bias: smaller models poorly detect ungrounded content, producing unreliable CF scores.
  • CF can miss conversational nuance if sentence categorization mislabels informational sentences.
  • CR binary label may hide partial or incomplete retrieved evidence needed for full answers.

Core Entities

Models

  • PaLM-2 (text-bison@002)
  • Mistral-7B
  • LLaMA-8B
  • GPT-4o
  • gpt-o3-mini
  • claude-3.5-sonnet
  • gemini-2-flash
  • mistral-large-2402
  • llama-3-8B
  • llama-3.3-70B

Metrics

  • Conversational Faithfulness (CF)
  • Accuracy
  • Context Relevance (CR)
  • Perceived Faithfulness (PF)
  • RF (statement-level faithfulness baseline)

Datasets

  • FaithfulnessQAC
  • UniqueQAC
  • ClinicalQAC
  • HealthSearchQA
  • Cataract post-op question set