Overview
The pipeline is ready to prototype using managed services, shows clear gains on one benchmark, but depends on a curated vector corpus and proprietary model APIs, so expect integration and data-quality work before production.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can raise domain accuracy of off-the-shelf LLMs without costly fine-tuning by adding a small retrieval corpus, re-ranking, and stepwise prompting, lowering risk and time-to-value for domain AI features.
Who Should Care
Summary TLDR
ARCoT (Adaptable Retrieval-based Chain of Thought) joins a hybrid retrieval step (including a ‘step-back’ query), a re-ranking transformer, and chain-of-thought prompting to improve domain accuracy without fine-tuning. Evaluated on 128 multiple-choice medical physics questions, ARCoT raised GPT-4 from 67% to 90% and gave an average improvement of ~47% over base models and ~15% over RAG alone. The pipeline uses ~60 domain documents (≈10k vectors), OpenAI embeddings, Pinecone, and a Cohere re-ranker, and it deliberately limits context to ~8 top passages to avoid long-context degradation.
Problem Statement
General-purpose LLMs lack up-to-date, detailed domain knowledge and can hallucinate. Fine-tuning is costly and risky. The paper asks: can we get specialist-level accuracy in medical physics by combining retrieval and prompting without retraining?
Main Contribution
ARCoT framework combining hybrid retrieval, step-back prompting, re-ranking, and chain-of-thought reasoning.
Practical retrieval pipeline: ~60 open-source domain docs → ~10k vectors, OpenAI embeddings, Pinecone store, Cohere re-ranker, keep top 8 passages.
Key Findings
GPT-4 score rose from 67% (base) to 90% with ARCoT.
ARCoT gave an average improvement over base models and RAG.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4 overall score | Base 67% → ARCoT 90% | Base GPT-4 67% | +23 pp | 128 RAPHEX questions | Table 1; Results | Table 1 |
| Average improvement | avg +47% vs base; +15% vs RAG | Base models and RAG-only runs | +47% / +15% | All models on 128-question exam | Results section | Results |
What To Try In 7 Days
Assemble a small, high-quality domain doc set (~50–100 texts) and chunk into vectors.
Embed queries and also generate a 'step-back' simpler query to expand retrieval hits.
Use a re-ranker and keep ~6–8 top passages to avoid long-context degradation and reduce token costs.
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Small, non-comprehensive domain corpus (~60 open-source docs); broader data may change results.
Benchmark excludes image/table questions; not multimodal.
When Not To Use
When your domain data includes essential images or tables (ARCoT is text-only).
When you can afford/need full fine-tuning for a production-grade domain model.
Failure Modes
Wrong or missing retrieved documents lead to incorrect answers (hallucination persists).
Too many context passages cause 'lost in the middle' and degraded performance.

