Overview
HGOT shows practical gains on fact-heavy tasks by combining LLM planning with citation-aware voting and passage re-ranking, but it requires external search, tuning, and LLM costs.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
HGOT reduces fact errors by structuring multi-step retrieval and scoring evidence; this can raise factual accuracy in verification and QA systems with modest engineering and tuning.
Who Should Care
Summary TLDR
HGOT builds a multi-level dependency graph of sub-questions (a hierarchical "graph of thoughts") and uses retrieval plus citation-aware scoring to pick answers. It weights majority voting by the quality of each generated rationale (measured via citation precision/recall) and re-ranks retrieved passages by citation frequency, thought quality, and retrieval rank. On public factual QA benchmarks HGOT improves factual accuracy (notably FEVER) and matches top baselines on Open-SQuAD and HotPotQA when tuned.
Problem Statement
LLMs hallucinate when answering fact-heavy queries. Simple retrieval-augmented prompts can miss interdependent facts or rank weak evidence. The paper asks: can a structured, multi-level plan plus citation-aware scoring reduce hallucinations and boost factual accuracy in retrieval-augmented in-context learning?
Main Contribution
HGOT: a method that builds a hierarchical dependency graph of sub-questions using LLM planning and searches those sub-questions recursively.
Thought-quality weighting: modifies self-consistency majority voting to weight answers by citation precision and recall of their rationales.
Key Findings
HGOT increases FEVER exact-match (EM) accuracy versus baselines.
HGOT matches or slightly surpasses leading baselines on Open-SQuAD and HotPotQA when tuned.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| FEVER EM (Overall) | 61.50% | Retrieve-then-Read 58.35% | +3.15 pp | FEVER overall | Table 2: HGOT+Sampling 61.50 vs Retrieve-then-Read 58.35 | — |
| Open-SQuAD EM (Overall) | 24.10% | Retrieve-then-Read 22.51% | +1.59 pp | Open-SQuAD overall | Table 2: HGOT+KNN 24.10 vs Retrieve-then-Read 22.51 | — |
What To Try In 7 Days
Run HGOT's Probe+Plan+Infer loop on a small FEVER-like subset to compare EM gains versus your current retrieval pipeline.
Add citation-precision and recall to your self-consistency voting and measure top-K change.
Swap SerpApi for your production search and compare passage re-ranking with the paper's weighted citation score.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Evaluated only with ChatGPT (gpt-3.5-turbo-1106); other LMs like Gemini or Llama 2 not tested.
Retrieval limited to SerpApi Google Search; results may change with different search engines or domain sources.
When Not To Use
Low-latency or low-cost services where multiple retrieval steps and LM calls are infeasible.
Tasks that do not require external factual grounding or where a single passage suffices.
Failure Modes
Incorrect dependency graph from planning leads to wrong sub-queries and cascading errors.
Citation detection/NLI errors can mis-weight good evidence and promote wrong answers.

