Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Structured, hybrid retrieval plus an iterate-and-verify loop can raise domain QA accuracy and expert trust without expensive model retraining, lowering deployment risk for regulated or knowledge-heavy applications.
Summary TLDR
This paper introduces TOSRR: a tree-organized knowledge format (SPO-T) plus a self-reflective Retrieval-Augmented Generation loop for Traditional Chinese Medicine (TCM) Q&A. Without any model fine-tuning, TOSRR + GPT-4 raises TCM licensing exam accuracy from 55.83% to 75.67% (≈+19.8 pp), increases recall accuracy on a classics exam from 0.27 to 0.38, and improves expert-rated safety/consistency/explainability by 18.52 points. The work shows that structured knowledge plus iterative retrieval and critique can improve domain QA, but the system is evaluated only on curated exam sets and expert ratings, not live clinical deployment.
Problem Statement
LLMs alone can be inaccurate on TCM tasks because pretraining lacks reliable, hierarchical TCM knowledge. Existing RAG setups either dump long text or sparse triples and fail when context grows. There is also no TCM-specific RAG framework and benchmark to measure practical gains.
Main Contribution
SPO-T: a hybrid Subject-Predicate-Object-Text format that stores triples linked to original text chunks arranged in a tree hierarchy.
A self-reflective RAG loop that iteratively retrieves, checks whether answers are supported by retrieved SPO-Ts, reformulates questions, and re-retrieves as needed.
A curated evaluation on two TCM datasets (TCM Medical Licensing Examination and Classics Course Exam) plus multi-dimensional expert scoring for safety, consistency, explainability, compliance, and self-consistency.
Key Findings
TOSRR with GPT-4 raised accuracy on the TCM Medical Licensing Examination dataset.
Recall accuracy on the Classics Course Exam improved under SPO-T RAG.
Expert-rated answer quality increased across five dimensions.
Naive RAG can reduce performance versus base LLM when recalled info is noisy.
Results
Accuracy
Accuracy
Accuracy
Expert manual total score (100-point)
Who Should Care
What To Try In 7 Days
Build a small SPO-T style tree for a targeted domain chapter and index text chunks with embeddings.
Combine keyword matching with dense retrieval (hybrid recall) and keep top-15 items for prompting.
Add a simple self-check: ask the LLM whether each claim is supported by retrieved items; if not, re-query or reformulate.
Agent Features
Memory
- retrieval memory (vector store of SPO-T and text chunks)
Tool Use
- text embeddings (text-embedd-ada-002)
- vector DB (Yandex HNSWLib)
- keyword search (IK Analysis / Elasticsearch)
- document segmentation (ERNIE-Layout)
Frameworks
- TOSRR
- SPO-T
- SELF-RAG
Architectures
- RAG
- tree-structured knowledge base (SPO-T)
- self-reflective retrieval loop
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- No fine-tuning or model parameter changes; gains depend on curated knowledge base quality.
- Evaluations are limited to exam-style questions and expert panels; real-world clinical safety not demonstrated.
- Recall evaluation relied on manual expert scoring and may not generalize; datasets/code are not openly released.
When Not To Use
- Direct clinical decision-making without human oversight or further validation.
- Domains with little structured textual source material to build SPO-T nodes.
- When full auditability of all knowledge sources is required and retrieval provenance must be cryptographically verifiable.
Failure Modes
- Performance drop if retrieval returns many irrelevant items (naive RAG harmed GPT-4 here).
- Errors persist when SPO-T extraction or expert review misses critical context.
- Overconfidence: model may assert unsupported claims if the self-check step is imperfect.
Core Entities
Models
- GPT-4
- text-embedd-ada-002 (embedding model)
Metrics
- Accuracy
- expert average score (five-dimension, converted to 100)
- 95% bootstrap confidence intervals
Datasets
- TCM Medical Licensing Examination (MLE) dataset (8,400 Qs; 600 sampled for test)
- Classics Course Exam (CCE) dataset (1,892 Qs)
Benchmarks
- TCM MLE (constructed for this work)
- CCE (constructed for this work)
Context Entities
Models
- Qibo (referenced TCM-tuned LLaMA-13B)
- Hengqin-RA-v1 (referenced)
Metrics
- pass thresholds for TCM MLE (historical human scores)
Datasets
- TCMBench (referenced)

