ASTRID: three automated, scalable metrics (CF, RA, CR) to evaluate RAG clinical QA

Overview

Decision SnapshotReady For Pilot

ASTRID is methodologically clear and validated on real and clinician-augmented data; evidence is from small but realistic datasets and multiple LLMs, so it's promising for development pipelines but needs broader multi-turn and multi-specialty validation.

Citations1

Evidence Strength0.85

Confidence0.87

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim

Links

Abstract / PDF

Why It Matters For Business

ASTRID gives an automated, clinically validated way to detect ungrounded, out-of-scope, or irrelevant answers; this reduces expensive clinician review and speeds safe iterative development of RAG-based clinical agents.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

ASTRID is a simple, automatable triad of metrics for RAG-based clinical question answering: Conversational Faithfulness (CF) measures how much information in an answer is grounded in retrieved context; Refusal Accuracy (RA) checks whether the system correctly declines out-of-scope questions; Context Relevance (CR) checks whether the retrieved context matches the question. On a cataract post-op dataset and clinician-augmented examples, CF aligns much better with human perceived faithfulness (AUC 0.98 vs 0.83) and the triad plus a scope label predicts clinician-rated harmfulness (avg F1 ≈ 0.835) and helpfulness (avg F1 ≈ 0.715). Several large LLMs can compute these metrics automatically with C

Problem Statement

Existing RAG evaluation metrics break in conversational clinical settings: they either fragment responses into statements (losing nuance), mis-handle empathetic or clarifying dialogue, or fail to detect when a system should rightly refuse to answer. Human clinical review is accurate but too costly and slow for iterative development. Developers need automated, validated metrics that map to clinical risk and can be run continuously.

Main Contribution

A safety-oriented hazard analysis for clinical RAG QA systems guided by SACE principles.

ASTRID: three reference-free, LLM-based metrics—Conversational Faithfulness (CF), Refusal Accuracy (RA), Context Relevance (CR)—designed for conversational clinical QA.

Key Findings

Conversational Faithfulness (CF) matches human perceived faithfulness much better than statement-level faithfulness (RF).

NumbersAUC CF=0.98 vs RF=0.83; Pearson CF vs PF=0.90, RF vs PF=0.57

Practical UseUse CF instead of statement-level faithfulness when evaluating conversational clinical answers to get automated scores that closely track human judgment.

Evidence RefFigure 5; Table 1 (ROC and correlations)

ASTRID's triad (CF, CR, RA) plus a scope label predicts clinician-rated harmfulness and helpfulness accurately.

NumbersAverage F1 harmfulness ≈ 0.835; helpfulness ≈ 0.715

Practical UseCombine CF, CR, RA and a simple in/out-of-scope flag to triage or flag risky responses for clinician review during development.

Evidence RefTable 2 (classifier F1-scores)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
CF vs RF (human alignment)	CF AUC 0.98; RF AUC 0.83	RF (statement-level)	AUC +0.15	FaithfulnessQAC	Figure 5	Figure 5
Correlation CF vs PF	Pearson 0.90; Spearman 0.90; Kendall Tau 0.84	RF Pearson 0.57	Pearson +0.33	FaithfulnessQAC	Table 1; correlation table	Table 1

What To Try In 7 Days

Run ASTRID CF/CR/RA prompts on 200 recent RAG outputs to surface ungrounded answers.

Compare CF scores to a small clinician sample (50 examples) to calibrate thresholds.

Automate RA to block or flag out-of-scope replies before deployment.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Single-turn focus: does not evaluate multi-turn dialogue continuity or cumulative risk.

Single clinical domain: validated mainly on cataract post-op data; generalization to other specialties is untested.

When Not To Use

For end-to-end multi-turn safety assessment without extension to dialogue continuity.

For non-clinical or broad general-domain QA where different retrieval dynamics apply.

Failure Modes

LLM-as-judge bias: smaller models poorly detect ungrounded content, producing unreliable CF scores.

CF can miss conversational nuance if sentence categorization mislabels informational sentences.

Core Entities

Models

PaLM-2 (text-bison@002)Mistral-7BLLaMA-8BGPT-4ogpt-o3-miniclaude-3.5-sonnetgemini-2-flashmistral-large-2402llama-3-8Bllama-3.3-70B

Metrics

Conversational Faithfulness (CF)AccuracyContext Relevance (CR)Perceived Faithfulness (PF)RF (statement-level faithfulness baseline)

Datasets

FaithfulnessQACUniqueQACClinicalQACHealthSearchQACataract post-op question set

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Conversational Faithfulness (CF) matches human perceived faithfulness much better than statement-level faithfulness (RF).

ASTRID's triad (CF, CR, RA) plus a scope label predicts clinician-rated harmfulness and helpfulness accurately.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding