ARCoT: hybrid retrieval + step-back + chain-of-thought boosts LLMs on a medical physics exam to 90%

May 17, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Jace Grandinetti, Rafe McBeth

Links

Abstract / PDF

Why It Matters For Business

You can raise domain accuracy of off-the-shelf LLMs without costly fine-tuning by adding a small retrieval corpus, re-ranking, and stepwise prompting, lowering risk and time-to-value for domain AI features.

Summary TLDR

ARCoT (Adaptable Retrieval-based Chain of Thought) joins a hybrid retrieval step (including a ‘step-back’ query), a re-ranking transformer, and chain-of-thought prompting to improve domain accuracy without fine-tuning. Evaluated on 128 multiple-choice medical physics questions, ARCoT raised GPT-4 from 67% to 90% and gave an average improvement of ~47% over base models and ~15% over RAG alone. The pipeline uses ~60 domain documents (≈10k vectors), OpenAI embeddings, Pinecone, and a Cohere re-ranker, and it deliberately limits context to ~8 top passages to avoid long-context degradation.

Problem Statement

General-purpose LLMs lack up-to-date, detailed domain knowledge and can hallucinate. Fine-tuning is costly and risky. The paper asks: can we get specialist-level accuracy in medical physics by combining retrieval and prompting without retraining?

Main Contribution

ARCoT framework combining hybrid retrieval, step-back prompting, re-ranking, and chain-of-thought reasoning.

Practical retrieval pipeline: ~60 open-source domain docs → ~10k vectors, OpenAI embeddings, Pinecone store, Cohere re-ranker, keep top 8 passages.

Strict multiple-choice benchmark protocol (128 RAPHEX questions) that requires five identical correct answers to count.

Empirical gains: ARCoT improves average model scores and brings GPT-4 above reported human average without fine-tuning.

Key Findings

GPT-4 score rose from 67% (base) to 90% with ARCoT.

Numbers67% → 90% (+23 percentage points)

ARCoT gave an average improvement over base models and RAG.

Numbersavg +47% vs base; +15% vs RAG alone

Smaller models benefit most from ARCoT.

NumbersGPT-3.5 showed up to +68% improvement

Results

GPT-4 overall score

ValueBase 67% → ARCoT 90%

BaselineBase GPT-4 67%

Average improvement

Valueavg +47% vs base; +15% vs RAG

BaselineBase models and RAG-only runs

Scoring rigour

ValueAnswer counted only if same correct answer in 5 runs

BaselineSingle-run scoring

Who Should Care

What To Try In 7 Days

Assemble a small, high-quality domain doc set (~50–100 texts) and chunk into vectors.

Embed queries and also generate a 'step-back' simpler query to expand retrieval hits.

Use a re-ranker and keep ~6–8 top passages to avoid long-context degradation and reduce token costs.

Optimization Features

Token Efficiency

  • re-rank to limit context to ~8 passages

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small, non-comprehensive domain corpus (~60 open-source docs); broader data may change results.
  • Benchmark excludes image/table questions; not multimodal.
  • Evaluated only on commercial LLMs; open-source models not tested.
  • No code or dataset release reported, limiting exact reproducibility.

When Not To Use

  • When your domain data includes essential images or tables (ARCoT is text-only).
  • When you can afford/need full fine-tuning for a production-grade domain model.
  • When you lack a reliable source of domain documents to index.

Failure Modes

  • Wrong or missing retrieved documents lead to incorrect answers (hallucination persists).
  • Too many context passages cause 'lost in the middle' and degraded performance.
  • Models that strictly ignore external context (or constrain answers) can underperform despite retrieval.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • Claude 2.1
  • Gemini Pro 1.0

Metrics

  • Percent correct
  • Average improvement vs base
  • Improvement vs RAG

Datasets

  • RAPHEX 2023 Therapy multiple-choice (128 Q subset)
  • Collection of ~60 open-source medical-physics documents (AAPM Task Groups, MPPGs, IAEA text)

Benchmarks

  • Medical physics multiple-choice exam (128 RAPHEX questions)