Survey: hybrid LLM architectures (RAG, agents, verifiers) for complex question answering

Overview

Decision SnapshotNeeds Validation

The survey synthesizes many practical, field-proven patterns (RAG + agents + verifiers) that are production-usable but require careful engineering for cost, privacy, and evaluation.

Citations6

Evidence Strength0.75

Confidence0.84

Risk Signals13

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Xavier Daull, Patrice Bellot, Emmanuel Bruno, Vincent Martin, Elisabeth Murisasco

Links

Abstract / PDF

Why It Matters For Business

For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

This is a practical survey of methods to make large language models (LLMs) answer complex, non-factoid questions. It catalogs required skills (decomposition, retrieval, reasoning, memory), evaluation suites and metrics (HELM, BBH, MMLU-Pro, FActScore/RAGAS), training and preference objectives (SFT, RLHF, DPO/ORPO/KTO, RFT), and a taxonomy of hybrid architectures (RAG, tools/code interpreters, verifier loops, multi-agent controllers). The paper concludes the current best practice for robust complex QA is an agentic, retrieval-grounded, verifier-guided pipeline that allocates extra inference “reasoning-time” for hard queries and reports evidence/attribution alongside answers. It also flags big

Problem Statement

Off-the-shelf LLMs are excellent at many single-step tasks but fail on complex, multi-step, domain-specific questions that need decomposition, multi-source grounding, deep reasoning, explainability, sensitive-data controls, and human alignment. The field lacks stable end-to-end benchmarks, standardized metrics for long-form faithfulness, and deployment patterns that balance cost, privacy, and accuracy.

Main Contribution

Systematic review and taxonomy of skills, tasks, and limits for complex QA with LLMs.

Survey and critique of evaluation metrics, living leaderboards, and datasets for complex QA.

Key Findings

Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.

Practical UseBuild pipelines that orchestrate LLM planners, retrievers, tool calls and verifiers; expect to trade latency and cost for higher factuality and auditability.

Evidence RefConclusion & §8 (agentic meta-architectures, §6 hybridization, §4 eval)

Preference tuning (SFT + human feedback) materially improves answer quality and alignment; Instruct-style RLHF led to strong human preference gains in practice.

Numbers85% of human raters preferred InstructGPT outputs (reported example)

Practical UseAdd preference optimization (RLHF or RL-free DPO/ORPO/KTO) when you need helpfulness and alignment; calibrate labeler expertise to your domain.

Evidence Ref§5.6 (RLHF example referencing [168])

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human preference for Instruct-style alignment	85% preferred (reported example for InstructGPT SFT+RLHF)	GPT-3 pre-alignment	—	—	Instruct-style RLHF example reported higher human preference for aligned outputs	§5.6 (citing [168])
BBH sample task performance (logic / reasoning)	Top open LLM scores vary by task; example: logical_deduction_three_objects 98.4% (top open LLM) vs human average 73.6%	Human average 73.6%	+24.8 pp (top open LLM vs human avg) on that task	BIG-bench Hard (BBH) example in Table 1	Table 1 BBH comparisons of human vs top open LLM scores across tasks	§3.3 (Table 1)

What To Try In 7 Days

Prototype a RAG pipeline: index a small domain corpus + dense retriever + LLM prompt with retrieved snippets and cited evidence.

Add a simple verifier (few-shot LLM-as-judge) to rerank two candidate answers and inspect disagreements.

Replace full fine-tuning with QLoRA or LoRA adapters on a 4-bit quantized model for a quick domain adaptation proof-of-concept.

Agent Features

Memory

short-term KV memory (session)long-term external memory / indexepisodic memory for reflection

Planning

task decomposition & plan generationadaptive compute allocation (reasoning-time)hierarchical task routing

Tool Use

function calling / API orchestrationcode interpreters for computationsearch and retrieval tools

Frameworks

MetaGPT-style workflowsDSPy for compiling pipelinesAgentLab / Deep Researcher patterns

Is Agentic

Yes

Architectures

planner-worker multi-agenthierarchical agentscontroller (plan-act-observe-reflect) + tools

Collaboration

multi-agent coordinationrole-based workers (planner/researcher/reviewer)

Optimization Features

Token Efficiency

train-short/test-long (ALiBi) for long contextscontext compression and retrieval gating

Infra Optimization

sparse MoE to reduce FLOPs per requestGPU/TPU memory-aware training schedules (LoftQ)

Model Optimization

MoEweight pruning (SparseGPT/Sparse methods)LoRA

System Optimization

LoRAfederated/offsite tuning to protect private data

Training Optimization

mixture-of-denoisers (U-PaLM style)mid-training for retrieval/tool-awarenesscompute-optimal token/model scaling (Chinchilla principles)

Inference Optimization

reasoning-time allocation (cascades, self-consistency)speculative decoding and FlashAttention for speedadaptive early-exit cascades

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Hallucination remains a core failure mode; verifiers and retrieval reduce but do not eliminate it.

Evaluation blind spots for long-form free answers and explainability; automatic metrics are imperfect.

When Not To Use

When very low latency and minimal cost are mandatory (simple models or cached answers suffice).

For single-fact or short-span QA where standard IR+smaller models already work well.

Failure Modes

Ungrounded hallucinations despite citations.

Retrieval errors or index poisoning leading to wrong evidence.

Core Entities

Models

GPT-familyRETROAtlasDeepSeek-R1InstructGPTPaLM / U-PaLMMoELLaMA / Llama-3

Metrics

AccuracyRecall@k / nDCG / MAP / MRRBERTScore / BARTScore / T5ScoreFActScore (factual precision)RAGAS (groundedness)Calibration (RMS)

Datasets

HELMBIG-bench (BBH/BBEH)MMLU-ProHumanity's Last ExamEli5SQuADMS MARCOLongBenchHotpotQACRAG

Benchmarks

HELM CapabilitiesBBH / BBEHHumanity's Last Exam (HLE)MMLU-ProArena-Hard / Chatbot Arena Elo

Context Entities

Models

WebGPTGopherCiteMetaGPTDeep ResearcherAlphaCode

Metrics

ROUGE / BLEU / METEORMAUVEHuman preference (Elo/Arena)Factuality/grounding judge-based scores

Datasets

LongBench v2CRAGMoreHopQABookSumBioASQNarrativeQA

Benchmarks

KILTSWE-benchTrustLLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.

Preference tuning (SFT + human feedback) materially improves answer quality and alignment; Instruct-style RLHF led to strong human preference gains in practice.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding