Survey: hybrid LLM architectures (RAG, agents, verifiers) for complex question answering

February 17, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

6

Authors

Xavier Daull, Patrice Bellot, Emmanuel Bruno, Vincent Martin, Elisabeth Murisasco

Links

Abstract / PDF

Why It Matters For Business

For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.

Summary TLDR

This is a practical survey of methods to make large language models (LLMs) answer complex, non-factoid questions. It catalogs required skills (decomposition, retrieval, reasoning, memory), evaluation suites and metrics (HELM, BBH, MMLU-Pro, FActScore/RAGAS), training and preference objectives (SFT, RLHF, DPO/ORPO/KTO, RFT), and a taxonomy of hybrid architectures (RAG, tools/code interpreters, verifier loops, multi-agent controllers). The paper concludes the current best practice for robust complex QA is an agentic, retrieval-grounded, verifier-guided pipeline that allocates extra inference “reasoning-time” for hard queries and reports evidence/attribution alongside answers. It also flags big

Problem Statement

Off-the-shelf LLMs are excellent at many single-step tasks but fail on complex, multi-step, domain-specific questions that need decomposition, multi-source grounding, deep reasoning, explainability, sensitive-data controls, and human alignment. The field lacks stable end-to-end benchmarks, standardized metrics for long-form faithfulness, and deployment patterns that balance cost, privacy, and accuracy.

Main Contribution

Systematic review and taxonomy of skills, tasks, and limits for complex QA with LLMs.

Survey and critique of evaluation metrics, living leaderboards, and datasets for complex QA.

Categorization of hybrid architectural patterns that augment LLMs (retrieval, tools, verifiers, agents).

Practical recipes across training (pre/mid/post), prompting, PEFT, and inference-time strategies (reasoning-time, cascades).

Identification of persistent research gaps: hallucination, data multi-sensitivity, cost, and robust decomposition.

Key Findings

Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.

Preference tuning (SFT + human feedback) materially improves answer quality and alignment; Instruct-style RLHF led to strong human preference gains in practice.

Numbers85% of human raters preferred InstructGPT outputs (reported example)

Retrieval-aware mid-training (Self-RAG) and process reward models (PRMs) increase grounding and reduce hallucination in RAG-style systems.

Compute-aware training tricks can cut large pre-training costs: mixture-of-denoisers (U-PaLM) matched PaLM performance at ~half compute, saving ~4.4M TPUv4 hours (reported example).

Numbers≈4.4M TPUv4 hours saved (U-PaLM vs PaLM 540B)

Results

Human preference for Instruct-style alignment

Value85% preferred (reported example for InstructGPT SFT+RLHF)

BaselineGPT-3 pre-alignment

BBH sample task performance (logic / reasoning)

ValueTop open LLM scores vary by task; example: logical_deduction_three_objects 98.4% (top open LLM) vs human average 73.6%

BaselineHuman average 73.6%

Who Should Care

What To Try In 7 Days

Prototype a RAG pipeline: index a small domain corpus + dense retriever + LLM prompt with retrieved snippets and cited evidence.

Add a simple verifier (few-shot LLM-as-judge) to rerank two candidate answers and inspect disagreements.

Replace full fine-tuning with QLoRA or LoRA adapters on a 4-bit quantized model for a quick domain adaptation proof-of-concept.

Agent Features

Memory

  • short-term KV memory (session)
  • long-term external memory / index
  • episodic memory for reflection

Planning

  • task decomposition & plan generation
  • adaptive compute allocation (reasoning-time)
  • hierarchical task routing

Tool Use

  • function calling / API orchestration
  • code interpreters for computation
  • search and retrieval tools

Frameworks

  • MetaGPT-style workflows
  • DSPy for compiling pipelines
  • AgentLab / Deep Researcher patterns

Is Agentic

true

Architectures

  • planner-worker multi-agent
  • hierarchical agents
  • controller (plan-act-observe-reflect) + tools

Collaboration

  • multi-agent coordination
  • role-based workers (planner/researcher/reviewer)

Optimization Features

Token Efficiency

  • train-short/test-long (ALiBi) for long contexts
  • context compression and retrieval gating

Infra Optimization

  • sparse MoE to reduce FLOPs per request
  • GPU/TPU memory-aware training schedules (LoftQ)

Model Optimization

  • MoE
  • weight pruning (SparseGPT/Sparse methods)
  • LoRA

System Optimization

  • LoRA
  • federated/offsite tuning to protect private data

Training Optimization

  • mixture-of-denoisers (U-PaLM style)
  • mid-training for retrieval/tool-awareness
  • compute-optimal token/model scaling (Chinchilla principles)

Inference Optimization

  • reasoning-time allocation (cascades, self-consistency)
  • speculative decoding and FlashAttention for speed
  • adaptive early-exit cascades

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Hallucination remains a core failure mode; verifiers and retrieval reduce but do not eliminate it.
  • Evaluation blind spots for long-form free answers and explainability; automatic metrics are imperfect.
  • High compute and operational cost for training and for inference-heavy agentic pipelines.
  • Handling multi-sensitivity data safely needs defense-in-depth; solutions are complex to implement.
  • Robust decomposition for open-ended, domain-specific non-factoid questions is still an open research problem.

When Not To Use

  • When very low latency and minimal cost are mandatory (simple models or cached answers suffice).
  • For single-fact or short-span QA where standard IR+smaller models already work well.
  • If you lack capacity to enforce retrieval access controls for sensitive corpora.

Failure Modes

  • Ungrounded hallucinations despite citations.
  • Retrieval errors or index poisoning leading to wrong evidence.
  • Tool/API misuse or runaway tool loops in agentic systems.
  • Bias amplification from preference labels or labeler selection.
  • Privacy leakage via retrieval or prompt-injection attacks.

Core Entities

Models

  • GPT-family
  • RETRO
  • Atlas
  • DeepSeek-R1
  • InstructGPT
  • PaLM / U-PaLM
  • MoE
  • LLaMA / Llama-3

Metrics

  • Accuracy
  • Recall@k / nDCG / MAP / MRR
  • BERTScore / BARTScore / T5Score
  • FActScore (factual precision)
  • RAGAS (groundedness)
  • Calibration (RMS)

Datasets

  • HELM
  • BIG-bench (BBH/BBEH)
  • MMLU-Pro
  • Humanity's Last Exam
  • Eli5
  • SQuAD
  • MS MARCO
  • LongBench
  • HotpotQA
  • CRAG

Benchmarks

  • HELM Capabilities
  • BBH / BBEH
  • Humanity's Last Exam (HLE)
  • MMLU-Pro
  • Arena-Hard / Chatbot Arena Elo

Context Entities

Models

  • WebGPT
  • GopherCite
  • MetaGPT
  • Deep Researcher
  • AlphaCode

Metrics

  • ROUGE / BLEU / METEOR
  • MAUVE
  • Human preference (Elo/Arena)
  • Factuality/grounding judge-based scores

Datasets

  • LongBench v2
  • CRAG
  • MoreHopQA
  • BookSum
  • BioASQ
  • NarrativeQA

Benchmarks

  • KILT
  • SWE-bench
  • TrustLLM