ARCoT: hybrid retrieval + step-back + chain-of-thought boosts LLMs on a medical physics exam to 90%

Overview

Decision SnapshotNeeds Validation

The pipeline is ready to prototype using managed services, shows clear gains on one benchmark, but depends on a curated vector corpus and proprietary model APIs, so expect integration and data-quality work before production.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Jace Grandinetti, Rafe McBeth

Links

Abstract / PDF

Why It Matters For Business

You can raise domain accuracy of off-the-shelf LLMs without costly fine-tuning by adding a small retrieval corpus, re-ranking, and stepwise prompting, lowering risk and time-to-value for domain AI features.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

ARCoT (Adaptable Retrieval-based Chain of Thought) joins a hybrid retrieval step (including a ‘step-back’ query), a re-ranking transformer, and chain-of-thought prompting to improve domain accuracy without fine-tuning. Evaluated on 128 multiple-choice medical physics questions, ARCoT raised GPT-4 from 67% to 90% and gave an average improvement of ~47% over base models and ~15% over RAG alone. The pipeline uses ~60 domain documents (≈10k vectors), OpenAI embeddings, Pinecone, and a Cohere re-ranker, and it deliberately limits context to ~8 top passages to avoid long-context degradation.

Problem Statement

General-purpose LLMs lack up-to-date, detailed domain knowledge and can hallucinate. Fine-tuning is costly and risky. The paper asks: can we get specialist-level accuracy in medical physics by combining retrieval and prompting without retraining?

Main Contribution

ARCoT framework combining hybrid retrieval, step-back prompting, re-ranking, and chain-of-thought reasoning.

Practical retrieval pipeline: ~60 open-source domain docs → ~10k vectors, OpenAI embeddings, Pinecone store, Cohere re-ranker, keep top 8 passages.

Key Findings

GPT-4 score rose from 67% (base) to 90% with ARCoT.

Numbers67% → 90% (+23 percentage points)

Practical UseApply ARCoT to advanced LLMs to achieve human-level or better accuracy on medical-physics multiple-choice tasks without retraining.

Evidence RefResults / Table 1

ARCoT gave an average improvement over base models and RAG.

Numbersavg +47% vs base; +15% vs RAG alone

Practical UseIf you already use RAG, add step-back prompts and re-ranking for another ~15% gain on similar benchmarks.

Evidence RefResults section (average improvements)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 overall score	Base 67% → ARCoT 90%	Base GPT-4 67%	+23 pp	128 RAPHEX questions	Table 1; Results	Table 1
Average improvement	avg +47% vs base; +15% vs RAG	Base models and RAG-only runs	+47% / +15%	All models on 128-question exam	Results section	Results

What To Try In 7 Days

Assemble a small, high-quality domain doc set (~50–100 texts) and chunk into vectors.

Embed queries and also generate a 'step-back' simpler query to expand retrieval hits.

Use a re-ranker and keep ~6–8 top passages to avoid long-context degradation and reduce token costs.

Optimization Features

Token Efficiency

re-rank to limit context to ~8 passages

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Small, non-comprehensive domain corpus (~60 open-source docs); broader data may change results.

Benchmark excludes image/table questions; not multimodal.

When Not To Use

When your domain data includes essential images or tables (ARCoT is text-only).

When you can afford/need full fine-tuning for a production-grade domain model.

Failure Modes

Wrong or missing retrieved documents lead to incorrect answers (hallucination persists).

Too many context passages cause 'lost in the middle' and degraded performance.

Core Entities

Models

GPT-4GPT-3.5Claude 2.1Gemini Pro 1.0

Metrics

Percent correctAverage improvement vs baseImprovement vs RAG

Datasets

RAPHEX 2023 Therapy multiple-choice (128 Q subset)Collection of ~60 open-source medical-physics documents (AAPM Task Groups, MPPGs, IAEA text)

Benchmarks

Medical physics multiple-choice exam (128 RAPHEX questions)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 score rose from 67% (base) to 90% with ARCoT.

ARCoT gave an average improvement over base models and RAG.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f