Short, guided retrieval loops that unify text, tables and KGs for faster, auditable multi-hop QA

Overview

Decision SnapshotNeeds Validation

The idea is practically focused: unify formats into a reversible sequence and run learned, budgeted selection. Empirical gains are shown across multiple benchmarks with detailed ablations. Results are robustly reported but depend on SFT, careful calibration of sufficiency thresholds, and engineering of HSEQ.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 8/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RELOOP reduces wasted retrieval work and provides explicit provenance. This yields more accurate multi-step answers across mixed data formats while keeping latency and token/tool costs predictable.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

RELOOP turns retrieval into short, guided iterations over a single, reversible hierarchical sequence that encodes text, tables and knowledge-graph triples. A lightweight planner (head) gives a retrieval prior, a trained iterator picks small windows of segments and predicts when evidence is sufficient, and a canonicalizer packages provenance for the answerer. The system improves multi-hop accuracy across text/table/KG QA while keeping short, predictable loops and explicit evidence provenance.

Problem Statement

Current RAG pipelines either run a single big retrieval step that misses multi-step evidence chains, or use unconstrained agentic loops that explode tool/token costs and lack a clear stop rule. Different data formats (text, tables, KGs) also force separate retrievers and controllers, complicating deployment and audit.

Main Contribution

HSEQ: a reversible hierarchical sequence that linearizes text, table rows, and KG triples into typed segments with parent pointers and offsets for provenance.

A budget-aware iterative iterator (RELOOP-I) that selects small windows of segments, expands structure-aware neighborhoods, and predicts a sufficiency stop signal.

Key Findings

RELOOP yields higher QA accuracy than strong baselines across heterogeneous benchmarks.

NumbersHybridQA acc 66.4 / F1 72.1; TAT-QA acc 75.7 / F1 83.5; HotpotQA acc 56.3 / F1 58.6 (Table 2)

Practical UseUse RELOOP when you need cross-format multi-hop QA; it improves answer quality on evaluated datasets compared with single-pass and many RAG baselines.

Evidence RefTable 2

Guided, budgeted iteration keeps loops short while retaining multi-hop power.

NumbersRELOOP best: ~3–4 iterations; HotpotQA latency 6.2k ms vs ToG 22.7k ms (Table 3)

Practical UseIf latency or token/tool budgets matter, prefer RELOOP to unconstrained graph traversals—expect a few short iterations rather than many expansion steps.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	66.4	best RAG baselines (HippoRAG 65.8)	+0.6pp vs top baselines	HybridQA test	Table 2: RELOOP (best) achieves 66.4 acc	Table 2
HybridQA F1	72.1	HippoRAG 72.4 (slightly higher)	-0.3pp vs top F1	HybridQA test	Table 2: RELOOP F1 72.1	Table 2

What To Try In 7 Days

Convert a small mixed corpus to HSEQ: record paragraphs, table rows, and triples with simple offsets and parent pointers.

Fine-tune a small iterator LLM via LoRA to emit compact JSON actions and a sufficiency score over a held-out dev set.

Add a tiny planner that emits short, cached guidance templates and test iteration depth vs accuracy to pick a production pair.

Agent Features

Memory

Retrieval memory over HSEQ segmentsShort-term windowed exposure (bounded per step)

Planning

Head-generated guidance (2–4 sentence plans)Heuristic guidance templates

Tool Use

Iterator emits JSON actions and queries the head plannerOptional verifier triggers short refinement loops

Frameworks

RELOOP-I (iterator)RELOOP-H (head)HSEQ-Adapter (HSEQ-A)κ canonicalizer

Is Agentic

Yes

Architectures

LLM-based iterator + head agentHierarchical segment stream (HSEQ)

Collaboration

Single multi-module agent (head + iterator + canonicalizer)No multi-agent negotiation reported

Optimization Features

Token Efficiency

Canonical evidence packages (snippets and IDs) reduce prompt size for the headCached guidance avoids repeated planner calls

Infra Optimization

Experiments fit on up to 4 H200 GPUs with mixed precision and 4-bit quantization

Model Optimization

LoRAMixed precision and 4-bit weight quantization for memory efficiency

System Optimization

Deterministic JSON outputs to simplify downstream parsing and evaluation

Training Optimization

Teacher-forcing supervision for early stepsWeak-positive labeling and per-step weighting to reduce noiseCurriculum: short-to-long episodes and dataset-mixing quotas

Inference Optimization

Windowed candidate stream to cap per-step contextTop-k selection per step and deterministic refresh to bound workEarly stopping via learned sufficiency head

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/RELOOP

Data URLs

HotpotQA (public)HybridQA (public)TAT-QA (public)MetaQA (public)

Risks & Boundaries

Limitations

Sufficiency judge can fail under noisy or partial evidence; authors note hallucination risk in the sufficiency head.

Framework assumes HSEQ metadata (offsets, row indices, triple fields) is available—costly to produce for arbitrary corpora.

When Not To Use

For trivial single-hop queries where LLM-only or single-pass retrieval is enough and latency is critical.

When you cannot construct reversible segment metadata (offsets, schema, parent pointers).

Failure Modes

Sufficiency false positives cause premature stopping and wrong answers.

Poor guidance (cache miss or bad planner output) can force extra iterations or miss supporting evidence.

Core Entities

Models

Qwen3-4B-Instruct-2507Falcon-H1-7B-InstructFalcon3-10B-instructLlama-3.1-8B-InstructFalcon3-3B-instructLlama-3.2-3B-Instruct

Metrics

AccuracyF1Iteration StepsLatency (ms)Tokens inspected

Datasets

HotpotQAHybridQATAT-QAMetaQA-2HopMetaQA-3Hop

Benchmarks

HotpotQA (text multi-hop)HybridQA (table+text)TAT-QA (financial table+text)MetaQA (KG 2/3-hop)

Context Entities

Models

Falcon-H1 familyDeepSeek-R1-Distill-Qwen-7B

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RELOOP yields higher QA accuracy than strong baselines across heterogeneous benchmarks.

Guided, budgeted iteration keeps loops short while retaining multi-hop power.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding