Survey: hybrid LLM architectures (RAG, agents, verifiers) for complex question answering

February 17, 20239 min

Overview

Decision SnapshotNeeds Validation

The survey synthesizes many practical, field-proven patterns (RAG + agents + verifiers) that are production-usable but require careful engineering for cost, privacy, and evaluation.

Citations6

Evidence Strength0.75

Confidence0.84

Risk Signals13

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Xavier Daull, Patrice Bellot, Emmanuel Bruno, Vincent Martin, Elisabeth Murisasco

Links

Abstract / PDF

Why It Matters For Business

For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.

Who Should Care

Summary TLDR

This is a practical survey of methods to make large language models (LLMs) answer complex, non-factoid questions. It catalogs required skills (decomposition, retrieval, reasoning, memory), evaluation suites and metrics (HELM, BBH, MMLU-Pro, FActScore/RAGAS), training and preference objectives (SFT, RLHF, DPO/ORPO/KTO, RFT), and a taxonomy of hybrid architectures (RAG, tools/code interpreters, verifier loops, multi-agent controllers). The paper concludes the current best practice for robust complex QA is an agentic, retrieval-grounded, verifier-guided pipeline that allocates extra inference “reasoning-time” for hard queries and reports evidence/attribution alongside answers. It also flags big

Problem Statement

Off-the-shelf LLMs are excellent at many single-step tasks but fail on complex, multi-step, domain-specific questions that need decomposition, multi-source grounding, deep reasoning, explainability, sensitive-data controls, and human alignment. The field lacks stable end-to-end benchmarks, standardized metrics for long-form faithfulness, and deployment patterns that balance cost, privacy, and accuracy.

Main Contribution

Systematic review and taxonomy of skills, tasks, and limits for complex QA with LLMs.

Survey and critique of evaluation metrics, living leaderboards, and datasets for complex QA.

Key Findings

Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.

Practical UseBuild pipelines that orchestrate LLM planners, retrievers, tool calls and verifiers; expect to trade latency and cost for higher factuality and auditability.

Evidence RefConclusion & §8 (agentic meta-architectures, §6 hybridization, §4 eval)

Preference tuning (SFT + human feedback) materially improves answer quality and alignment; Instruct-style RLHF led to strong human preference gains in practice.

Numbers85% of human raters preferred InstructGPT outputs (reported example)

Practical UseAdd preference optimization (RLHF or RL-free DPO/ORPO/KTO) when you need helpfulness and alignment; calibrate labeler expertise to your domain.

Evidence Ref§5.6 (RLHF example referencing [168])

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human preference for Instruct-style alignment85% preferred (reported example for InstructGPT SFT+RLHF)GPT-3 pre-alignmentInstruct-style RLHF example reported higher human preference for aligned outputs§5.6 (citing [168])
BBH sample task performance (logic / reasoning)Top open LLM scores vary by task; example: logical_deduction_three_objects 98.4% (top open LLM) vs human average 73.6%Human average 73.6%+24.8 pp (top open LLM vs human avg) on that taskBIG-bench Hard (BBH) example in Table 1Table 1 BBH comparisons of human vs top open LLM scores across tasks§3.3 (Table 1)

What To Try In 7 Days

Prototype a RAG pipeline: index a small domain corpus + dense retriever + LLM prompt with retrieved snippets and cited evidence.

Add a simple verifier (few-shot LLM-as-judge) to rerank two candidate answers and inspect disagreements.

Replace full fine-tuning with QLoRA or LoRA adapters on a 4-bit quantized model for a quick domain adaptation proof-of-concept.

Agent Features

Memory
short-term KV memory (session)long-term external memory / indexepisodic memory for reflection
Planning
task decomposition & plan generationadaptive compute allocation (reasoning-time)hierarchical task routing
Tool Use
function calling / API orchestrationcode interpreters for computationsearch and retrieval tools
Frameworks
MetaGPT-style workflowsDSPy for compiling pipelinesAgentLab / Deep Researcher patterns
Is Agentic

Yes

Architectures
planner-worker multi-agenthierarchical agentscontroller (plan-act-observe-reflect) + tools
Collaboration
multi-agent coordinationrole-based workers (planner/researcher/reviewer)

Optimization Features

Token Efficiency
train-short/test-long (ALiBi) for long contextscontext compression and retrieval gating
Infra Optimization
sparse MoE to reduce FLOPs per requestGPU/TPU memory-aware training schedules (LoftQ)
Model Optimization
MoEweight pruning (SparseGPT/Sparse methods)LoRA
System Optimization
LoRAfederated/offsite tuning to protect private data
Training Optimization
mixture-of-denoisers (U-PaLM style)mid-training for retrieval/tool-awarenesscompute-optimal token/model scaling (Chinchilla principles)
Inference Optimization
reasoning-time allocation (cascades, self-consistency)speculative decoding and FlashAttention for speedadaptive early-exit cascades

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Hallucination remains a core failure mode; verifiers and retrieval reduce but do not eliminate it.

Evaluation blind spots for long-form free answers and explainability; automatic metrics are imperfect.

When Not To Use

When very low latency and minimal cost are mandatory (simple models or cached answers suffice).

For single-fact or short-span QA where standard IR+smaller models already work well.

Failure Modes

Ungrounded hallucinations despite citations.

Retrieval errors or index poisoning leading to wrong evidence.

Core Entities

Models

GPT-familyRETROAtlasDeepSeek-R1InstructGPTPaLM / U-PaLMMoELLaMA / Llama-3

Metrics

AccuracyRecall@k / nDCG / MAP / MRRBERTScore / BARTScore / T5ScoreFActScore (factual precision)RAGAS (groundedness)Calibration (RMS)

Datasets

HELMBIG-bench (BBH/BBEH)MMLU-ProHumanity's Last ExamEli5SQuADMS MARCOLongBenchHotpotQACRAG

Benchmarks

HELM CapabilitiesBBH / BBEHHumanity's Last Exam (HLE)MMLU-ProArena-Hard / Chatbot Arena Elo

Context Entities

Models

WebGPTGopherCiteMetaGPTDeep ResearcherAlphaCode

Metrics

ROUGE / BLEU / METEORMAUVEHuman preference (Elo/Arena)Factuality/grounding judge-based scores

Datasets

LongBench v2CRAGMoreHopQABookSumBioASQNarrativeQA

Benchmarks

KILTSWE-benchTrustLLM