Tree-structured TCM knowledge + self-reflective retrieval boosts GPT-4 exam accuracy by ~20 percentage points

February 13, 20258 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and low-cost (no finetuning). Evidence is moderate: strong gains on curated exam sets and expert ratings, but no live clinical or public dataset release to confirm broader generalization.

Citations1

Evidence Strength0.60

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin

Links

Abstract / PDF

Why It Matters For Business

Structured, hybrid retrieval plus an iterate-and-verify loop can raise domain QA accuracy and expert trust without expensive model retraining, lowering deployment risk for regulated or knowledge-heavy applications.

Who Should Care

Summary TLDR

This paper introduces TOSRR: a tree-organized knowledge format (SPO-T) plus a self-reflective Retrieval-Augmented Generation loop for Traditional Chinese Medicine (TCM) Q&A. Without any model fine-tuning, TOSRR + GPT-4 raises TCM licensing exam accuracy from 55.83% to 75.67% (≈+19.8 pp), increases recall accuracy on a classics exam from 0.27 to 0.38, and improves expert-rated safety/consistency/explainability by 18.52 points. The work shows that structured knowledge plus iterative retrieval and critique can improve domain QA, but the system is evaluated only on curated exam sets and expert ratings, not live clinical deployment.

Problem Statement

LLMs alone can be inaccurate on TCM tasks because pretraining lacks reliable, hierarchical TCM knowledge. Existing RAG setups either dump long text or sparse triples and fail when context grows. There is also no TCM-specific RAG framework and benchmark to measure practical gains.

Main Contribution

SPO-T: a hybrid Subject-Predicate-Object-Text format that stores triples linked to original text chunks arranged in a tree hierarchy.

A self-reflective RAG loop that iteratively retrieves, checks whether answers are supported by retrieved SPO-Ts, reformulates questions, and re-retrieves as needed.

Key Findings

TOSRR with GPT-4 raised accuracy on the TCM Medical Licensing Examination dataset.

NumbersGPT-4 55.83% -> TOSRR 75.67% (+19.84 pp)

Practical UseUse a tree-organized knowledge base plus self-reflective retrieval to substantially raise closed-book exam accuracy without finetuning.

Evidence RefTable 1

Recall accuracy on the Classics Course Exam improved under SPO-T RAG.

NumbersRAG recall 0.27 -> SPO-T RAG 0.38 (+0.11)

Practical UseHybrid keyword + dense retrieval over SPO-T retrieves more task-relevant evidence, helping domain answers.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyTOSRR 75.67%GPT-4 55.83%+19.84 ppTCM MLE (600-question test)Table 1 accuracy comparisonTable 1
AccuracySPO-T RAG 70.17%RAG 49.83%+20.34 pp (SPO-T RAG vs RAG)TCM MLE (600-question test)Table 1 shows SPO-T RAG improves over plain RAGTable 1

What To Try In 7 Days

Build a small SPO-T style tree for a targeted domain chapter and index text chunks with embeddings.

Combine keyword matching with dense retrieval (hybrid recall) and keep top-15 items for prompting.

Add a simple self-check: ask the LLM whether each claim is supported by retrieved items; if not, re-query or reformulate.

Agent Features

Memory
retrieval memory (vector store of SPO-T and text chunks)
Tool Use
text embeddings (text-embedd-ada-002)vector DB (Yandex HNSWLib)keyword search (IK Analysis / Elasticsearch)document segmentation (ERNIE-Layout)
Frameworks
TOSRRSPO-TSELF-RAG
Architectures
RAGtree-structured knowledge base (SPO-T)self-reflective retrieval loop

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No fine-tuning or model parameter changes; gains depend on curated knowledge base quality.

Evaluations are limited to exam-style questions and expert panels; real-world clinical safety not demonstrated.

When Not To Use

Direct clinical decision-making without human oversight or further validation.

Domains with little structured textual source material to build SPO-T nodes.

Failure Modes

Performance drop if retrieval returns many irrelevant items (naive RAG harmed GPT-4 here).

Errors persist when SPO-T extraction or expert review misses critical context.

Core Entities

Models

GPT-4text-embedd-ada-002 (embedding model)

Metrics

Accuracyexpert average score (five-dimension, converted to 100)95% bootstrap confidence intervals

Datasets

TCM Medical Licensing Examination (MLE) dataset (8,400 Qs; 600 sampled for test)Classics Course Exam (CCE) dataset (1,892 Qs)

Benchmarks

TCM MLE (constructed for this work)CCE (constructed for this work)

Context Entities

Models

Qibo (referenced TCM-tuned LLaMA-13B)Hengqin-RA-v1 (referenced)

Metrics

pass thresholds for TCM MLE (historical human scores)

Datasets

TCMBench (referenced)