Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
6
Why It Matters For Business
For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.
Summary TLDR
This is a practical survey of methods to make large language models (LLMs) answer complex, non-factoid questions. It catalogs required skills (decomposition, retrieval, reasoning, memory), evaluation suites and metrics (HELM, BBH, MMLU-Pro, FActScore/RAGAS), training and preference objectives (SFT, RLHF, DPO/ORPO/KTO, RFT), and a taxonomy of hybrid architectures (RAG, tools/code interpreters, verifier loops, multi-agent controllers). The paper concludes the current best practice for robust complex QA is an agentic, retrieval-grounded, verifier-guided pipeline that allocates extra inference “reasoning-time” for hard queries and reports evidence/attribution alongside answers. It also flags big
Problem Statement
Off-the-shelf LLMs are excellent at many single-step tasks but fail on complex, multi-step, domain-specific questions that need decomposition, multi-source grounding, deep reasoning, explainability, sensitive-data controls, and human alignment. The field lacks stable end-to-end benchmarks, standardized metrics for long-form faithfulness, and deployment patterns that balance cost, privacy, and accuracy.
Main Contribution
Systematic review and taxonomy of skills, tasks, and limits for complex QA with LLMs.
Survey and critique of evaluation metrics, living leaderboards, and datasets for complex QA.
Categorization of hybrid architectural patterns that augment LLMs (retrieval, tools, verifiers, agents).
Practical recipes across training (pre/mid/post), prompting, PEFT, and inference-time strategies (reasoning-time, cascades).
Identification of persistent research gaps: hallucination, data multi-sensitivity, cost, and robust decomposition.
Key Findings
Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.
Preference tuning (SFT + human feedback) materially improves answer quality and alignment; Instruct-style RLHF led to strong human preference gains in practice.
Retrieval-aware mid-training (Self-RAG) and process reward models (PRMs) increase grounding and reduce hallucination in RAG-style systems.
Compute-aware training tricks can cut large pre-training costs: mixture-of-denoisers (U-PaLM) matched PaLM performance at ~half compute, saving ~4.4M TPUv4 hours (reported example).
Results
Human preference for Instruct-style alignment
BBH sample task performance (logic / reasoning)
Who Should Care
What To Try In 7 Days
Prototype a RAG pipeline: index a small domain corpus + dense retriever + LLM prompt with retrieved snippets and cited evidence.
Add a simple verifier (few-shot LLM-as-judge) to rerank two candidate answers and inspect disagreements.
Replace full fine-tuning with QLoRA or LoRA adapters on a 4-bit quantized model for a quick domain adaptation proof-of-concept.
Agent Features
Memory
- short-term KV memory (session)
- long-term external memory / index
- episodic memory for reflection
Planning
- task decomposition & plan generation
- adaptive compute allocation (reasoning-time)
- hierarchical task routing
Tool Use
- function calling / API orchestration
- code interpreters for computation
- search and retrieval tools
Frameworks
- MetaGPT-style workflows
- DSPy for compiling pipelines
- AgentLab / Deep Researcher patterns
Is Agentic
true
Architectures
- planner-worker multi-agent
- hierarchical agents
- controller (plan-act-observe-reflect) + tools
Collaboration
- multi-agent coordination
- role-based workers (planner/researcher/reviewer)
Optimization Features
Token Efficiency
- train-short/test-long (ALiBi) for long contexts
- context compression and retrieval gating
Infra Optimization
- sparse MoE to reduce FLOPs per request
- GPU/TPU memory-aware training schedules (LoftQ)
Model Optimization
- MoE
- weight pruning (SparseGPT/Sparse methods)
- LoRA
System Optimization
- LoRA
- federated/offsite tuning to protect private data
Training Optimization
- mixture-of-denoisers (U-PaLM style)
- mid-training for retrieval/tool-awareness
- compute-optimal token/model scaling (Chinchilla principles)
Inference Optimization
- reasoning-time allocation (cascades, self-consistency)
- speculative decoding and FlashAttention for speed
- adaptive early-exit cascades
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Hallucination remains a core failure mode; verifiers and retrieval reduce but do not eliminate it.
- Evaluation blind spots for long-form free answers and explainability; automatic metrics are imperfect.
- High compute and operational cost for training and for inference-heavy agentic pipelines.
- Handling multi-sensitivity data safely needs defense-in-depth; solutions are complex to implement.
- Robust decomposition for open-ended, domain-specific non-factoid questions is still an open research problem.
When Not To Use
- When very low latency and minimal cost are mandatory (simple models or cached answers suffice).
- For single-fact or short-span QA where standard IR+smaller models already work well.
- If you lack capacity to enforce retrieval access controls for sensitive corpora.
Failure Modes
- Ungrounded hallucinations despite citations.
- Retrieval errors or index poisoning leading to wrong evidence.
- Tool/API misuse or runaway tool loops in agentic systems.
- Bias amplification from preference labels or labeler selection.
- Privacy leakage via retrieval or prompt-injection attacks.
Core Entities
Models
- GPT-family
- RETRO
- Atlas
- DeepSeek-R1
- InstructGPT
- PaLM / U-PaLM
- MoE
- LLaMA / Llama-3
Metrics
- Accuracy
- Recall@k / nDCG / MAP / MRR
- BERTScore / BARTScore / T5Score
- FActScore (factual precision)
- RAGAS (groundedness)
- Calibration (RMS)
Datasets
- HELM
- BIG-bench (BBH/BBEH)
- MMLU-Pro
- Humanity's Last Exam
- Eli5
- SQuAD
- MS MARCO
- LongBench
- HotpotQA
- CRAG
Benchmarks
- HELM Capabilities
- BBH / BBEH
- Humanity's Last Exam (HLE)
- MMLU-Pro
- Arena-Hard / Chatbot Arena Elo
Context Entities
Models
- WebGPT
- GopherCite
- MetaGPT
- Deep Researcher
- AlphaCode
Metrics
- ROUGE / BLEU / METEOR
- MAUVE
- Human preference (Elo/Arena)
- Factuality/grounding judge-based scores
Datasets
- LongBench v2
- CRAG
- MoreHopQA
- BookSum
- BioASQ
- NarrativeQA
Benchmarks
- KILT
- SWE-bench
- TrustLLM

