Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
HGOT reduces fact errors by structuring multi-step retrieval and scoring evidence; this can raise factual accuracy in verification and QA systems with modest engineering and tuning.
Summary TLDR
HGOT builds a multi-level dependency graph of sub-questions (a hierarchical "graph of thoughts") and uses retrieval plus citation-aware scoring to pick answers. It weights majority voting by the quality of each generated rationale (measured via citation precision/recall) and re-ranks retrieved passages by citation frequency, thought quality, and retrieval rank. On public factual QA benchmarks HGOT improves factual accuracy (notably FEVER) and matches top baselines on Open-SQuAD and HotPotQA when tuned.
Problem Statement
LLMs hallucinate when answering fact-heavy queries. Simple retrieval-augmented prompts can miss interdependent facts or rank weak evidence. The paper asks: can a structured, multi-level plan plus citation-aware scoring reduce hallucinations and boost factual accuracy in retrieval-augmented in-context learning?
Main Contribution
HGOT: a method that builds a hierarchical dependency graph of sub-questions using LLM planning and searches those sub-questions recursively.
Thought-quality weighting: modifies self-consistency majority voting to weight answers by citation precision and recall of their rationales.
Retrieval scoring: a passage score that combines citation frequency, thought citation quality, self-consistency confidence, and retrieval rank.
Extensive evaluation stratified by question length on FEVER, Open-SQuAD, and HotPotQA, including ablations and hyperparameter search.
Key Findings
HGOT increases FEVER exact-match (EM) accuracy versus baselines.
HGOT matches or slightly surpasses leading baselines on Open-SQuAD and HotPotQA when tuned.
Incorporating thought and retrieval quality improves performance in ablation tests.
Results
FEVER EM (Overall)
Open-SQuAD EM (Overall)
HotPotQA EM (Overall)
Open-SQuAD EM (medium, abl.)
Who Should Care
What To Try In 7 Days
Run HGOT's Probe+Plan+Infer loop on a small FEVER-like subset to compare EM gains versus your current retrieval pipeline.
Add citation-precision and recall to your self-consistency voting and measure top-K change.
Swap SerpApi for your production search and compare passage re-ranking with the paper's weighted citation score.
Agent Features
Memory
- retrieval-based context collected per sub-query
Planning
- LLM emergent planning to split queries
- divide-and-conquer sub-query plans
- topological sorting of dependency graph
Tool Use
- search engine retrieval (SerpApi/Google)
- NLI model for citation detection
- LM for planning and reasoning
Frameworks
- DSP (framework used)
- NetworkX for graph operations
Is Agentic
true
Architectures
- hierarchical graph (multi-layer DAG of sub-queries)
- dependency graph with topological traversal
Collaboration
- weighted self-consistency majority voting
- demonstration selection (balanced sampling / KNN)
Optimization Features
Token Efficiency
- rewrite sub-queries to include prior answers to limit context growth
Inference Optimization
- select top-K passages for final prompt
- early stopping of planning when step similar
Reproducibility
Code Urls
Data Urls
- FEVER
- Open-SQuAD
- HotPotQA
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only with ChatGPT (gpt-3.5-turbo-1106); other LMs like Gemini or Llama 2 not tested.
- Retrieval limited to SerpApi Google Search; results may change with different search engines or domain sources.
- Additional latency and API costs due to recursive planning, multiple retrievals, and NLI passes.
When Not To Use
- Low-latency or low-cost services where multiple retrieval steps and LM calls are infeasible.
- Tasks that do not require external factual grounding or where a single passage suffices.
- Domains with no reliable external search index or NLI model for citation detection.
Failure Modes
- Incorrect dependency graph from planning leads to wrong sub-queries and cascading errors.
- Citation detection/NLI errors can mis-weight good evidence and promote wrong answers.
- Over-reliance on web search ranking may surface high-ranked but irrelevant passages.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo-1106)
- text-davinci-002 (for ReAct baseline)
- NLI model (TRUE-style Honovich et al. 2022)
Metrics
- EM
- F1
- citation precision
- citation recall
- self-consistency confidence
Datasets
- FEVER
- Open-SQuAD
- HotPotQA
Benchmarks
- Exact Match (EM)
- F1 score

