Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
You can raise domain accuracy of off-the-shelf LLMs without costly fine-tuning by adding a small retrieval corpus, re-ranking, and stepwise prompting, lowering risk and time-to-value for domain AI features.
Summary TLDR
ARCoT (Adaptable Retrieval-based Chain of Thought) joins a hybrid retrieval step (including a ‘step-back’ query), a re-ranking transformer, and chain-of-thought prompting to improve domain accuracy without fine-tuning. Evaluated on 128 multiple-choice medical physics questions, ARCoT raised GPT-4 from 67% to 90% and gave an average improvement of ~47% over base models and ~15% over RAG alone. The pipeline uses ~60 domain documents (≈10k vectors), OpenAI embeddings, Pinecone, and a Cohere re-ranker, and it deliberately limits context to ~8 top passages to avoid long-context degradation.
Problem Statement
General-purpose LLMs lack up-to-date, detailed domain knowledge and can hallucinate. Fine-tuning is costly and risky. The paper asks: can we get specialist-level accuracy in medical physics by combining retrieval and prompting without retraining?
Main Contribution
ARCoT framework combining hybrid retrieval, step-back prompting, re-ranking, and chain-of-thought reasoning.
Practical retrieval pipeline: ~60 open-source domain docs → ~10k vectors, OpenAI embeddings, Pinecone store, Cohere re-ranker, keep top 8 passages.
Strict multiple-choice benchmark protocol (128 RAPHEX questions) that requires five identical correct answers to count.
Empirical gains: ARCoT improves average model scores and brings GPT-4 above reported human average without fine-tuning.
Key Findings
GPT-4 score rose from 67% (base) to 90% with ARCoT.
ARCoT gave an average improvement over base models and RAG.
Smaller models benefit most from ARCoT.
Results
GPT-4 overall score
Average improvement
Scoring rigour
Who Should Care
What To Try In 7 Days
Assemble a small, high-quality domain doc set (~50–100 texts) and chunk into vectors.
Embed queries and also generate a 'step-back' simpler query to expand retrieval hits.
Use a re-ranker and keep ~6–8 top passages to avoid long-context degradation and reduce token costs.
Optimization Features
Token Efficiency
- re-rank to limit context to ~8 passages
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small, non-comprehensive domain corpus (~60 open-source docs); broader data may change results.
- Benchmark excludes image/table questions; not multimodal.
- Evaluated only on commercial LLMs; open-source models not tested.
- No code or dataset release reported, limiting exact reproducibility.
When Not To Use
- When your domain data includes essential images or tables (ARCoT is text-only).
- When you can afford/need full fine-tuning for a production-grade domain model.
- When you lack a reliable source of domain documents to index.
Failure Modes
- Wrong or missing retrieved documents lead to incorrect answers (hallucination persists).
- Too many context passages cause 'lost in the middle' and degraded performance.
- Models that strictly ignore external context (or constrain answers) can underperform despite retrieval.
Core Entities
Models
- GPT-4
- GPT-3.5
- Claude 2.1
- Gemini Pro 1.0
Metrics
- Percent correct
- Average improvement vs base
- Improvement vs RAG
Datasets
- RAPHEX 2023 Therapy multiple-choice (128 Q subset)
- Collection of ~60 open-source medical-physics documents (AAPM Task Groups, MPPGs, IAEA text)
Benchmarks
- Medical physics multiple-choice exam (128 RAPHEX questions)

