Overview
The approach is practical and low-cost (no finetuning). Evidence is moderate: strong gains on curated exam sets and expert ratings, but no live clinical or public dataset release to confirm broader generalization.
Citations1
Evidence Strength0.60
Confidence0.75
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Structured, hybrid retrieval plus an iterate-and-verify loop can raise domain QA accuracy and expert trust without expensive model retraining, lowering deployment risk for regulated or knowledge-heavy applications.
Who Should Care
Summary TLDR
This paper introduces TOSRR: a tree-organized knowledge format (SPO-T) plus a self-reflective Retrieval-Augmented Generation loop for Traditional Chinese Medicine (TCM) Q&A. Without any model fine-tuning, TOSRR + GPT-4 raises TCM licensing exam accuracy from 55.83% to 75.67% (≈+19.8 pp), increases recall accuracy on a classics exam from 0.27 to 0.38, and improves expert-rated safety/consistency/explainability by 18.52 points. The work shows that structured knowledge plus iterative retrieval and critique can improve domain QA, but the system is evaluated only on curated exam sets and expert ratings, not live clinical deployment.
Problem Statement
LLMs alone can be inaccurate on TCM tasks because pretraining lacks reliable, hierarchical TCM knowledge. Existing RAG setups either dump long text or sparse triples and fail when context grows. There is also no TCM-specific RAG framework and benchmark to measure practical gains.
Main Contribution
SPO-T: a hybrid Subject-Predicate-Object-Text format that stores triples linked to original text chunks arranged in a tree hierarchy.
A self-reflective RAG loop that iteratively retrieves, checks whether answers are supported by retrieved SPO-Ts, reformulates questions, and re-retrieves as needed.
Key Findings
TOSRR with GPT-4 raised accuracy on the TCM Medical Licensing Examination dataset.
Recall accuracy on the Classics Course Exam improved under SPO-T RAG.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | TOSRR 75.67% | GPT-4 55.83% | +19.84 pp | TCM MLE (600-question test) | Table 1 accuracy comparison | Table 1 |
| Accuracy | SPO-T RAG 70.17% | RAG 49.83% | +20.34 pp (SPO-T RAG vs RAG) | TCM MLE (600-question test) | Table 1 shows SPO-T RAG improves over plain RAG | Table 1 |
What To Try In 7 Days
Build a small SPO-T style tree for a targeted domain chapter and index text chunks with embeddings.
Combine keyword matching with dense retrieval (hybrid recall) and keep top-15 items for prompting.
Add a simple self-check: ask the LLM whether each claim is supported by retrieved items; if not, re-query or reformulate.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Reproducibility
Risks & Boundaries
Limitations
No fine-tuning or model parameter changes; gains depend on curated knowledge base quality.
Evaluations are limited to exam-style questions and expert panels; real-world clinical safety not demonstrated.
When Not To Use
Direct clinical decision-making without human oversight or further validation.
Domains with little structured textual source material to build SPO-T nodes.
Failure Modes
Performance drop if retrieval returns many irrelevant items (naive RAG harmed GPT-4 here).
Errors persist when SPO-T extraction or expert review misses critical context.

