ARCoT: hybrid retrieval + step-back + chain-of-thought boosts LLMs on a medical physics exam to 90%

May 17, 20247 min

Overview

Decision SnapshotNeeds Validation

The pipeline is ready to prototype using managed services, shows clear gains on one benchmark, but depends on a curated vector corpus and proprietary model APIs, so expect integration and data-quality work before production.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Jace Grandinetti, Rafe McBeth

Links

Abstract / PDF

Why It Matters For Business

You can raise domain accuracy of off-the-shelf LLMs without costly fine-tuning by adding a small retrieval corpus, re-ranking, and stepwise prompting, lowering risk and time-to-value for domain AI features.

Who Should Care

Summary TLDR

ARCoT (Adaptable Retrieval-based Chain of Thought) joins a hybrid retrieval step (including a ‘step-back’ query), a re-ranking transformer, and chain-of-thought prompting to improve domain accuracy without fine-tuning. Evaluated on 128 multiple-choice medical physics questions, ARCoT raised GPT-4 from 67% to 90% and gave an average improvement of ~47% over base models and ~15% over RAG alone. The pipeline uses ~60 domain documents (≈10k vectors), OpenAI embeddings, Pinecone, and a Cohere re-ranker, and it deliberately limits context to ~8 top passages to avoid long-context degradation.

Problem Statement

General-purpose LLMs lack up-to-date, detailed domain knowledge and can hallucinate. Fine-tuning is costly and risky. The paper asks: can we get specialist-level accuracy in medical physics by combining retrieval and prompting without retraining?

Main Contribution

ARCoT framework combining hybrid retrieval, step-back prompting, re-ranking, and chain-of-thought reasoning.

Practical retrieval pipeline: ~60 open-source domain docs → ~10k vectors, OpenAI embeddings, Pinecone store, Cohere re-ranker, keep top 8 passages.

Key Findings

GPT-4 score rose from 67% (base) to 90% with ARCoT.

Numbers67%90% (+23 percentage points)

Practical UseApply ARCoT to advanced LLMs to achieve human-level or better accuracy on medical-physics multiple-choice tasks without retraining.

Evidence RefResults / Table 1

ARCoT gave an average improvement over base models and RAG.

Numbersavg +47% vs base; +15% vs RAG alone

Practical UseIf you already use RAG, add step-back prompts and re-ranking for another ~15% gain on similar benchmarks.

Evidence RefResults section (average improvements)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4 overall scoreBase 67% → ARCoT 90%Base GPT-4 67%+23 pp128 RAPHEX questionsTable 1; ResultsTable 1
Average improvementavg +47% vs base; +15% vs RAGBase models and RAG-only runs+47% / +15%All models on 128-question examResults sectionResults

What To Try In 7 Days

Assemble a small, high-quality domain doc set (~50–100 texts) and chunk into vectors.

Embed queries and also generate a 'step-back' simpler query to expand retrieval hits.

Use a re-ranker and keep ~6–8 top passages to avoid long-context degradation and reduce token costs.

Optimization Features

Token Efficiency
re-rank to limit context to ~8 passages

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small, non-comprehensive domain corpus (~60 open-source docs); broader data may change results.

Benchmark excludes image/table questions; not multimodal.

When Not To Use

When your domain data includes essential images or tables (ARCoT is text-only).

When you can afford/need full fine-tuning for a production-grade domain model.

Failure Modes

Wrong or missing retrieved documents lead to incorrect answers (hallucination persists).

Too many context passages cause 'lost in the middle' and degraded performance.

Core Entities

Models

GPT-4GPT-3.5Claude 2.1Gemini Pro 1.0

Metrics

Percent correctAverage improvement vs baseImprovement vs RAG

Datasets

RAPHEX 2023 Therapy multiple-choice (128 Q subset)Collection of ~60 open-source medical-physics documents (AAPM Task Groups, MPPGs, IAEA text)

Benchmarks

Medical physics multiple-choice exam (128 RAPHEX questions)