Short, guided retrieval loops that unify text, tables and KGs for faster, auditable multi-hop QA

October 23, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Links

Abstract / PDF

Why It Matters For Business

RELOOP reduces wasted retrieval work and provides explicit provenance. This yields more accurate multi-step answers across mixed data formats while keeping latency and token/tool costs predictable.

Summary TLDR

RELOOP turns retrieval into short, guided iterations over a single, reversible hierarchical sequence that encodes text, tables and knowledge-graph triples. A lightweight planner (head) gives a retrieval prior, a trained iterator picks small windows of segments and predicts when evidence is sufficient, and a canonicalizer packages provenance for the answerer. The system improves multi-hop accuracy across text/table/KG QA while keeping short, predictable loops and explicit evidence provenance.

Problem Statement

Current RAG pipelines either run a single big retrieval step that misses multi-step evidence chains, or use unconstrained agentic loops that explode tool/token costs and lack a clear stop rule. Different data formats (text, tables, KGs) also force separate retrievers and controllers, complicating deployment and audit.

Main Contribution

HSEQ: a reversible hierarchical sequence that linearizes text, table rows, and KG triples into typed segments with parent pointers and offsets for provenance.

A budget-aware iterative iterator (RELOOP-I) that selects small windows of segments, expands structure-aware neighborhoods, and predicts a sufficiency stop signal.

A head planner (RELOOP-H) that provides short guidance plans to steer iteration and an evidence canonicalizer that packages provenance for a final answer and optional contradiction-driven refinement.

An open implementation recipe using parameter-efficient fine-tuning (LoRA), deterministic JSON action outputs for the iterator, and cached guidance to reduce overhead.

Empirical results showing consistent QA gains across text, table+text, and KG benchmarks with controllable accuracy-latency tradeoffs.

Key Findings

RELOOP yields higher QA accuracy than strong baselines across heterogeneous benchmarks.

NumbersHybridQA acc 66.4 / F1 72.1; TAT-QA acc 75.7 / F1 83.5; HotpotQA acc 56.3 / F1 58.6 (Table 2)

Guided, budgeted iteration keeps loops short while retaining multi-hop power.

NumbersRELOOP best: ~3–4 iterations; HotpotQA latency 6.2k ms vs ToG 22.7k ms (Table 3)

Supervised fine-tuning of the iterator and guidance both matter for accuracy.

NumbersAblation: w/o SFT drops HybridQA acc 66.4 → 57.3 (−9.1pp); w/o guidance 66.4 → 59.2 (−7.2pp) (Table 4)

A single learned policy can operate across text, tables and KGs via HSEQ.

Results

Accuracy

Value66.4

Baselinebest RAG baselines (HippoRAG 65.8)

HybridQA F1

Value72.1

BaselineHippoRAG 72.4 (slightly higher)

Accuracy

Value75.7

BaselineTAT-LLM 73.1

Accuracy

Value56.3

BaselineAdaptiveRAG 50.3

Accuracy

Value95.9

BaselineAdaptiveRAG 88.2

Accuracy

Value93.4

BaselineAdaptiveRAG 84.5

Iteration steps (HotpotQA)

Value4.00

BaselineThink on Graph 13.28 steps

Latency (HotpotQA)

Value6247.0 ms

BaselineToG 22708.2 ms

Who Should Care

What To Try In 7 Days

Convert a small mixed corpus to HSEQ: record paragraphs, table rows, and triples with simple offsets and parent pointers.

Fine-tune a small iterator LLM via LoRA to emit compact JSON actions and a sufficiency score over a held-out dev set.

Add a tiny planner that emits short, cached guidance templates and test iteration depth vs accuracy to pick a production pair.

Agent Features

Memory

  • Retrieval memory over HSEQ segments
  • Short-term windowed exposure (bounded per step)

Planning

  • Head-generated guidance (2–4 sentence plans)
  • Heuristic guidance templates

Tool Use

  • Iterator emits JSON actions and queries the head planner
  • Optional verifier triggers short refinement loops

Frameworks

  • RELOOP-I (iterator)
  • RELOOP-H (head)
  • HSEQ-Adapter (HSEQ-A)
  • κ canonicalizer

Is Agentic

true

Architectures

  • LLM-based iterator + head agent
  • Hierarchical segment stream (HSEQ)

Collaboration

  • Single multi-module agent (head + iterator + canonicalizer)
  • No multi-agent negotiation reported

Optimization Features

Token Efficiency

  • Canonical evidence packages (snippets and IDs) reduce prompt size for the head
  • Cached guidance avoids repeated planner calls

Infra Optimization

  • Experiments fit on up to 4 H200 GPUs with mixed precision and 4-bit quantization

Model Optimization

  • LoRA
  • Mixed precision and 4-bit weight quantization for memory efficiency

System Optimization

  • Deterministic JSON outputs to simplify downstream parsing and evaluation

Training Optimization

  • Teacher-forcing supervision for early steps
  • Weak-positive labeling and per-step weighting to reduce noise
  • Curriculum: short-to-long episodes and dataset-mixing quotas

Inference Optimization

  • Windowed candidate stream to cap per-step context
  • Top-k selection per step and deterministic refresh to bound work
  • Early stopping via learned sufficiency head

Reproducibility

Data Urls

  • HotpotQA (public)
  • HybridQA (public)
  • TAT-QA (public)
  • MetaQA (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Sufficiency judge can fail under noisy or partial evidence; authors note hallucination risk in the sufficiency head.
  • Framework assumes HSEQ metadata (offsets, row indices, triple fields) is available—costly to produce for arbitrary corpora.
  • Latency still higher than single-pass LLM-only; tradeoff depends on chosen agent pair and budgets.
  • Experiments use single-turn QA; multi-turn and streaming corpora are left for future work.

When Not To Use

  • For trivial single-hop queries where LLM-only or single-pass retrieval is enough and latency is critical.
  • When you cannot construct reversible segment metadata (offsets, schema, parent pointers).
  • When you lack resources to fine-tune or reliably calibrate the sufficiency head.

Failure Modes

  • Sufficiency false positives cause premature stopping and wrong answers.
  • Poor guidance (cache miss or bad planner output) can force extra iterations or miss supporting evidence.
  • Weak supervision for iterator trajectories can induce noisy policies if positive pools are poor.
  • Canonicalizer bugs or incorrect offsets break audit trails and reproducibility.

Core Entities

Models

  • Qwen3-4B-Instruct-2507
  • Falcon-H1-7B-Instruct
  • Falcon3-10B-instruct
  • Llama-3.1-8B-Instruct
  • Falcon3-3B-instruct
  • Llama-3.2-3B-Instruct

Metrics

  • Accuracy
  • F1
  • Iteration Steps
  • Latency (ms)
  • Tokens inspected

Datasets

  • HotpotQA
  • HybridQA
  • TAT-QA
  • MetaQA-2Hop
  • MetaQA-3Hop

Benchmarks

  • HotpotQA (text multi-hop)
  • HybridQA (table+text)
  • TAT-QA (financial table+text)
  • MetaQA (KG 2/3-hop)

Context Entities

Models

  • Falcon-H1 family
  • DeepSeek-R1-Distill-Qwen-7B