Tree-structured TCM knowledge + self-reflective retrieval boosts GPT-4 exam accuracy by ~20 percentage points

Overview

Decision SnapshotNeeds Validation

The approach is practical and low-cost (no finetuning). Evidence is moderate: strong gains on curated exam sets and expert ratings, but no live clinical or public dataset release to confirm broader generalization.

Citations1

Evidence Strength0.60

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin

Links

Abstract / PDF

Why It Matters For Business

Structured, hybrid retrieval plus an iterate-and-verify loop can raise domain QA accuracy and expert trust without expensive model retraining, lowering deployment risk for regulated or knowledge-heavy applications.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper introduces TOSRR: a tree-organized knowledge format (SPO-T) plus a self-reflective Retrieval-Augmented Generation loop for Traditional Chinese Medicine (TCM) Q&A. Without any model fine-tuning, TOSRR + GPT-4 raises TCM licensing exam accuracy from 55.83% to 75.67% (≈+19.8 pp), increases recall accuracy on a classics exam from 0.27 to 0.38, and improves expert-rated safety/consistency/explainability by 18.52 points. The work shows that structured knowledge plus iterative retrieval and critique can improve domain QA, but the system is evaluated only on curated exam sets and expert ratings, not live clinical deployment.

Problem Statement

LLMs alone can be inaccurate on TCM tasks because pretraining lacks reliable, hierarchical TCM knowledge. Existing RAG setups either dump long text or sparse triples and fail when context grows. There is also no TCM-specific RAG framework and benchmark to measure practical gains.

Main Contribution

SPO-T: a hybrid Subject-Predicate-Object-Text format that stores triples linked to original text chunks arranged in a tree hierarchy.

A self-reflective RAG loop that iteratively retrieves, checks whether answers are supported by retrieved SPO-Ts, reformulates questions, and re-retrieves as needed.

Key Findings

TOSRR with GPT-4 raised accuracy on the TCM Medical Licensing Examination dataset.

NumbersGPT-4 55.83% -> TOSRR 75.67% (+19.84 pp)

Practical UseUse a tree-organized knowledge base plus self-reflective retrieval to substantially raise closed-book exam accuracy without finetuning.

Evidence RefTable 1

Recall accuracy on the Classics Course Exam improved under SPO-T RAG.

NumbersRAG recall 0.27 -> SPO-T RAG 0.38 (+0.11)

Practical UseHybrid keyword + dense retrieval over SPO-T retrieves more task-relevant evidence, helping domain answers.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	TOSRR 75.67%	GPT-4 55.83%	+19.84 pp	TCM MLE (600-question test)	Table 1 accuracy comparison	Table 1
Accuracy	SPO-T RAG 70.17%	RAG 49.83%	+20.34 pp (SPO-T RAG vs RAG)	TCM MLE (600-question test)	Table 1 shows SPO-T RAG improves over plain RAG	Table 1

What To Try In 7 Days

Build a small SPO-T style tree for a targeted domain chapter and index text chunks with embeddings.

Combine keyword matching with dense retrieval (hybrid recall) and keep top-15 items for prompting.

Add a simple self-check: ask the LLM whether each claim is supported by retrieved items; if not, re-query or reformulate.

Agent Features

Memory

retrieval memory (vector store of SPO-T and text chunks)

Tool Use

text embeddings (text-embedd-ada-002)vector DB (Yandex HNSWLib)keyword search (IK Analysis / Elasticsearch)document segmentation (ERNIE-Layout)

Frameworks

TOSRRSPO-TSELF-RAG

Architectures

RAGtree-structured knowledge base (SPO-T)self-reflective retrieval loop

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

No fine-tuning or model parameter changes; gains depend on curated knowledge base quality.

Evaluations are limited to exam-style questions and expert panels; real-world clinical safety not demonstrated.

When Not To Use

Direct clinical decision-making without human oversight or further validation.

Domains with little structured textual source material to build SPO-T nodes.

Failure Modes

Performance drop if retrieval returns many irrelevant items (naive RAG harmed GPT-4 here).

Errors persist when SPO-T extraction or expert review misses critical context.

Core Entities

Models

GPT-4text-embedd-ada-002 (embedding model)

Metrics

Accuracyexpert average score (five-dimension, converted to 100)95% bootstrap confidence intervals

Datasets

TCM Medical Licensing Examination (MLE) dataset (8,400 Qs; 600 sampled for test)Classics Course Exam (CCE) dataset (1,892 Qs)

Benchmarks

TCM MLE (constructed for this work)CCE (constructed for this work)

Context Entities

Models

Qibo (referenced TCM-tuned LLaMA-13B)Hengqin-RA-v1 (referenced)

Metrics

pass thresholds for TCM MLE (historical human scores)

Datasets

TCMBench (referenced)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TOSRR with GPT-4 raised accuracy on the TCM Medical Licensing Examination dataset.

Recall accuracy on the Classics Course Exam improved under SPO-T RAG.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding