Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
RELOOP reduces wasted retrieval work and provides explicit provenance. This yields more accurate multi-step answers across mixed data formats while keeping latency and token/tool costs predictable.
Summary TLDR
RELOOP turns retrieval into short, guided iterations over a single, reversible hierarchical sequence that encodes text, tables and knowledge-graph triples. A lightweight planner (head) gives a retrieval prior, a trained iterator picks small windows of segments and predicts when evidence is sufficient, and a canonicalizer packages provenance for the answerer. The system improves multi-hop accuracy across text/table/KG QA while keeping short, predictable loops and explicit evidence provenance.
Problem Statement
Current RAG pipelines either run a single big retrieval step that misses multi-step evidence chains, or use unconstrained agentic loops that explode tool/token costs and lack a clear stop rule. Different data formats (text, tables, KGs) also force separate retrievers and controllers, complicating deployment and audit.
Main Contribution
HSEQ: a reversible hierarchical sequence that linearizes text, table rows, and KG triples into typed segments with parent pointers and offsets for provenance.
A budget-aware iterative iterator (RELOOP-I) that selects small windows of segments, expands structure-aware neighborhoods, and predicts a sufficiency stop signal.
A head planner (RELOOP-H) that provides short guidance plans to steer iteration and an evidence canonicalizer that packages provenance for a final answer and optional contradiction-driven refinement.
An open implementation recipe using parameter-efficient fine-tuning (LoRA), deterministic JSON action outputs for the iterator, and cached guidance to reduce overhead.
Empirical results showing consistent QA gains across text, table+text, and KG benchmarks with controllable accuracy-latency tradeoffs.
Key Findings
RELOOP yields higher QA accuracy than strong baselines across heterogeneous benchmarks.
Guided, budgeted iteration keeps loops short while retaining multi-hop power.
Supervised fine-tuning of the iterator and guidance both matter for accuracy.
A single learned policy can operate across text, tables and KGs via HSEQ.
Results
Accuracy
HybridQA F1
Accuracy
Accuracy
Accuracy
Accuracy
Iteration steps (HotpotQA)
Latency (HotpotQA)
Who Should Care
What To Try In 7 Days
Convert a small mixed corpus to HSEQ: record paragraphs, table rows, and triples with simple offsets and parent pointers.
Fine-tune a small iterator LLM via LoRA to emit compact JSON actions and a sufficiency score over a held-out dev set.
Add a tiny planner that emits short, cached guidance templates and test iteration depth vs accuracy to pick a production pair.
Agent Features
Memory
- Retrieval memory over HSEQ segments
- Short-term windowed exposure (bounded per step)
Planning
- Head-generated guidance (2–4 sentence plans)
- Heuristic guidance templates
Tool Use
- Iterator emits JSON actions and queries the head planner
- Optional verifier triggers short refinement loops
Frameworks
- RELOOP-I (iterator)
- RELOOP-H (head)
- HSEQ-Adapter (HSEQ-A)
- κ canonicalizer
Is Agentic
true
Architectures
- LLM-based iterator + head agent
- Hierarchical segment stream (HSEQ)
Collaboration
- Single multi-module agent (head + iterator + canonicalizer)
- No multi-agent negotiation reported
Optimization Features
Token Efficiency
- Canonical evidence packages (snippets and IDs) reduce prompt size for the head
- Cached guidance avoids repeated planner calls
Infra Optimization
- Experiments fit on up to 4 H200 GPUs with mixed precision and 4-bit quantization
Model Optimization
- LoRA
- Mixed precision and 4-bit weight quantization for memory efficiency
System Optimization
- Deterministic JSON outputs to simplify downstream parsing and evaluation
Training Optimization
- Teacher-forcing supervision for early steps
- Weak-positive labeling and per-step weighting to reduce noise
- Curriculum: short-to-long episodes and dataset-mixing quotas
Inference Optimization
- Windowed candidate stream to cap per-step context
- Top-k selection per step and deterministic refresh to bound work
- Early stopping via learned sufficiency head
Reproducibility
Data Urls
- HotpotQA (public)
- HybridQA (public)
- TAT-QA (public)
- MetaQA (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Sufficiency judge can fail under noisy or partial evidence; authors note hallucination risk in the sufficiency head.
- Framework assumes HSEQ metadata (offsets, row indices, triple fields) is available—costly to produce for arbitrary corpora.
- Latency still higher than single-pass LLM-only; tradeoff depends on chosen agent pair and budgets.
- Experiments use single-turn QA; multi-turn and streaming corpora are left for future work.
When Not To Use
- For trivial single-hop queries where LLM-only or single-pass retrieval is enough and latency is critical.
- When you cannot construct reversible segment metadata (offsets, schema, parent pointers).
- When you lack resources to fine-tune or reliably calibrate the sufficiency head.
Failure Modes
- Sufficiency false positives cause premature stopping and wrong answers.
- Poor guidance (cache miss or bad planner output) can force extra iterations or miss supporting evidence.
- Weak supervision for iterator trajectories can induce noisy policies if positive pools are poor.
- Canonicalizer bugs or incorrect offsets break audit trails and reproducibility.
Core Entities
Models
- Qwen3-4B-Instruct-2507
- Falcon-H1-7B-Instruct
- Falcon3-10B-instruct
- Llama-3.1-8B-Instruct
- Falcon3-3B-instruct
- Llama-3.2-3B-Instruct
Metrics
- Accuracy
- F1
- Iteration Steps
- Latency (ms)
- Tokens inspected
Datasets
- HotpotQA
- HybridQA
- TAT-QA
- MetaQA-2Hop
- MetaQA-3Hop
Benchmarks
- HotpotQA (text multi-hop)
- HybridQA (table+text)
- TAT-QA (financial table+text)
- MetaQA (KG 2/3-hop)
Context Entities
Models
- Falcon-H1 family
- DeepSeek-R1-Distill-Qwen-7B

