Overview
The survey synthesizes many practical, field-proven patterns (RAG + agents + verifiers) that are production-usable but require careful engineering for cost, privacy, and evaluation.
Citations6
Evidence Strength0.75
Confidence0.84
Risk Signals13
Trust Signals
Findings with numeric evidence: 2/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/2
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.
Who Should Care
Summary TLDR
This is a practical survey of methods to make large language models (LLMs) answer complex, non-factoid questions. It catalogs required skills (decomposition, retrieval, reasoning, memory), evaluation suites and metrics (HELM, BBH, MMLU-Pro, FActScore/RAGAS), training and preference objectives (SFT, RLHF, DPO/ORPO/KTO, RFT), and a taxonomy of hybrid architectures (RAG, tools/code interpreters, verifier loops, multi-agent controllers). The paper concludes the current best practice for robust complex QA is an agentic, retrieval-grounded, verifier-guided pipeline that allocates extra inference “reasoning-time” for hard queries and reports evidence/attribution alongside answers. It also flags big
Problem Statement
Off-the-shelf LLMs are excellent at many single-step tasks but fail on complex, multi-step, domain-specific questions that need decomposition, multi-source grounding, deep reasoning, explainability, sensitive-data controls, and human alignment. The field lacks stable end-to-end benchmarks, standardized metrics for long-form faithfulness, and deployment patterns that balance cost, privacy, and accuracy.
Main Contribution
Systematic review and taxonomy of skills, tasks, and limits for complex QA with LLMs.
Survey and critique of evaluation metrics, living leaderboards, and datasets for complex QA.
Key Findings
Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.
Preference tuning (SFT + human feedback) materially improves answer quality and alignment; Instruct-style RLHF led to strong human preference gains in practice.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Human preference for Instruct-style alignment | 85% preferred (reported example for InstructGPT SFT+RLHF) | GPT-3 pre-alignment | — | — | Instruct-style RLHF example reported higher human preference for aligned outputs | §5.6 (citing [168]) |
| BBH sample task performance (logic / reasoning) | Top open LLM scores vary by task; example: logical_deduction_three_objects 98.4% (top open LLM) vs human average 73.6% | Human average 73.6% | +24.8 pp (top open LLM vs human avg) on that task | BIG-bench Hard (BBH) example in Table 1 | Table 1 BBH comparisons of human vs top open LLM scores across tasks | §3.3 (Table 1) |
What To Try In 7 Days
Prototype a RAG pipeline: index a small domain corpus + dense retriever + LLM prompt with retrieved snippets and cited evidence.
Add a simple verifier (few-shot LLM-as-judge) to rerank two candidate answers and inspect disagreements.
Replace full fine-tuning with QLoRA or LoRA adapters on a 4-bit quantized model for a quick domain adaptation proof-of-concept.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Hallucination remains a core failure mode; verifiers and retrieval reduce but do not eliminate it.
Evaluation blind spots for long-form free answers and explainability; automatic metrics are imperfect.
When Not To Use
When very low latency and minimal cost are mandatory (simple models or cached answers suffice).
For single-fact or short-span QA where standard IR+smaller models already work well.
Failure Modes
Ungrounded hallucinations despite citations.
Retrieval errors or index poisoning leading to wrong evidence.

