Short, guided retrieval loops that unify text, tables and KGs for faster, auditable multi-hop QA

October 23, 20259 min

Overview

Decision SnapshotNeeds Validation

The idea is practically focused: unify formats into a reversible sequence and run learned, budgeted selection. Empirical gains are shown across multiple benchmarks with detailed ablations. Results are robustly reported but depend on SFT, careful calibration of sufficiency thresholds, and engineering of HSEQ.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 8/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RELOOP reduces wasted retrieval work and provides explicit provenance. This yields more accurate multi-step answers across mixed data formats while keeping latency and token/tool costs predictable.

Who Should Care

Summary TLDR

RELOOP turns retrieval into short, guided iterations over a single, reversible hierarchical sequence that encodes text, tables and knowledge-graph triples. A lightweight planner (head) gives a retrieval prior, a trained iterator picks small windows of segments and predicts when evidence is sufficient, and a canonicalizer packages provenance for the answerer. The system improves multi-hop accuracy across text/table/KG QA while keeping short, predictable loops and explicit evidence provenance.

Problem Statement

Current RAG pipelines either run a single big retrieval step that misses multi-step evidence chains, or use unconstrained agentic loops that explode tool/token costs and lack a clear stop rule. Different data formats (text, tables, KGs) also force separate retrievers and controllers, complicating deployment and audit.

Main Contribution

HSEQ: a reversible hierarchical sequence that linearizes text, table rows, and KG triples into typed segments with parent pointers and offsets for provenance.

A budget-aware iterative iterator (RELOOP-I) that selects small windows of segments, expands structure-aware neighborhoods, and predicts a sufficiency stop signal.

Key Findings

RELOOP yields higher QA accuracy than strong baselines across heterogeneous benchmarks.

NumbersHybridQA acc 66.4 / F1 72.1; TAT-QA acc 75.7 / F1 83.5; HotpotQA acc 56.3 / F1 58.6 (Table 2)

Practical UseUse RELOOP when you need cross-format multi-hop QA; it improves answer quality on evaluated datasets compared with single-pass and many RAG baselines.

Evidence RefTable 2

Guided, budgeted iteration keeps loops short while retaining multi-hop power.

NumbersRELOOP best: ~34 iterations; HotpotQA latency 6.2k ms vs ToG 22.7k ms (Table 3)

Practical UseIf latency or token/tool budgets matter, prefer RELOOP to unconstrained graph traversals—expect a few short iterations rather than many expansion steps.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy66.4best RAG baselines (HippoRAG 65.8)+0.6pp vs top baselinesHybridQA testTable 2: RELOOP (best) achieves 66.4 accTable 2
HybridQA F172.1HippoRAG 72.4 (slightly higher)-0.3pp vs top F1HybridQA testTable 2: RELOOP F1 72.1Table 2

What To Try In 7 Days

Convert a small mixed corpus to HSEQ: record paragraphs, table rows, and triples with simple offsets and parent pointers.

Fine-tune a small iterator LLM via LoRA to emit compact JSON actions and a sufficiency score over a held-out dev set.

Add a tiny planner that emits short, cached guidance templates and test iteration depth vs accuracy to pick a production pair.

Agent Features

Memory
Retrieval memory over HSEQ segmentsShort-term windowed exposure (bounded per step)
Planning
Head-generated guidance (2–4 sentence plans)Heuristic guidance templates
Tool Use
Iterator emits JSON actions and queries the head plannerOptional verifier triggers short refinement loops
Frameworks
RELOOP-I (iterator)RELOOP-H (head)HSEQ-Adapter (HSEQ-A)κ canonicalizer
Is Agentic

Yes

Architectures
LLM-based iterator + head agentHierarchical segment stream (HSEQ)
Collaboration
Single multi-module agent (head + iterator + canonicalizer)No multi-agent negotiation reported

Optimization Features

Token Efficiency
Canonical evidence packages (snippets and IDs) reduce prompt size for the headCached guidance avoids repeated planner calls
Infra Optimization
Experiments fit on up to 4 H200 GPUs with mixed precision and 4-bit quantization
Model Optimization
LoRAMixed precision and 4-bit weight quantization for memory efficiency
System Optimization
Deterministic JSON outputs to simplify downstream parsing and evaluation
Training Optimization
Teacher-forcing supervision for early stepsWeak-positive labeling and per-step weighting to reduce noiseCurriculum: short-to-long episodes and dataset-mixing quotas
Inference Optimization
Windowed candidate stream to cap per-step contextTop-k selection per step and deterministic refresh to bound workEarly stopping via learned sufficiency head

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HotpotQA (public)HybridQA (public)TAT-QA (public)MetaQA (public)

Risks & Boundaries

Limitations

Sufficiency judge can fail under noisy or partial evidence; authors note hallucination risk in the sufficiency head.

Framework assumes HSEQ metadata (offsets, row indices, triple fields) is available—costly to produce for arbitrary corpora.

When Not To Use

For trivial single-hop queries where LLM-only or single-pass retrieval is enough and latency is critical.

When you cannot construct reversible segment metadata (offsets, schema, parent pointers).

Failure Modes

Sufficiency false positives cause premature stopping and wrong answers.

Poor guidance (cache miss or bad planner output) can force extra iterations or miss supporting evidence.

Core Entities

Models

Qwen3-4B-Instruct-2507Falcon-H1-7B-InstructFalcon3-10B-instructLlama-3.1-8B-InstructFalcon3-3B-instructLlama-3.2-3B-Instruct

Metrics

AccuracyF1Iteration StepsLatency (ms)Tokens inspected

Datasets

HotpotQAHybridQATAT-QAMetaQA-2HopMetaQA-3Hop

Benchmarks

HotpotQA (text multi-hop)HybridQA (table+text)TAT-QA (financial table+text)MetaQA (KG 2/3-hop)

Context Entities

Models

Falcon-H1 familyDeepSeek-R1-Distill-Qwen-7B