Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
JERR improves answer accuracy and long‑range recall on long documents while producing an interpretable graph of facts; build once and reuse graphs to amortize cost.
Summary TLDR
JERR is an agent-style pipeline that turns long documents into chunk synopses, deduplicates entities, builds a directed acyclic graph (DAG) of facts, then uses Monte Carlo Tree Search (MCTS) to pick the most relevant graph nodes for answering questions. On three long‑context QA benchmarks (QuALITY, MuSiQue, NarrativeQA) JERR improves accuracy and recall metrics versus retrieval and agent baselines. It costs more tokens during graph construction but supports graph reuse to cut per‑query cost.
Problem Statement
Transformer LLMs struggle to reason over very long inputs and to retrieve the small set of facts needed for complex questions. We need a practical, interpretable method that filters redundancy, preserves causal links, and finds relevant facts without retraining the LLM.
Main Contribution
JERR: a three-stage pipeline—synopsis extraction, deduplication + DAG graph construction, and MCTS-based graph search for reasoning.
A two-stage deduplication (Bloom Filter + Trie for exact matches, SimHash for near-duplicates) to compress facts before graph building.
Use of MCTS to select top‑k relevant graph nodes for a question, combined with synopses and selected original chunks to generate answers.
Empirical gains on three long‑context QA datasets and ablations showing MCTS and top-k settings matter; token cost analysis and graph reuse discussed.
Key Findings
JERR yields the best accuracy on QuALITY multi-choice QA.
JERR improves long‑range recall on multi‑hop QA (MuSiQue).
MCTS selection beats PageRank for node selection.
Graph construction increases token cost but can be amortized by reuse.
Results
Accuracy
MuSiQue LR-1 / LR-2 / F1
NarrativeQA R-1 / R-2 / F1
Accuracy
Token consumption per query
Who Should Care
What To Try In 7 Days
Chunk a sample long document, extract synopses, and build a small DAG with Bloom+SimHash dedupe to test graph quality.
Run MCTS to pick top‑k nodes (k=5 default) and compare answers to a simple retrieval baseline.
Measure token and latency cost; test graph reuse to see per‑query savings.
Agent Features
Memory
- synopsis summaries (short structured memory)
Planning
- Monte Carlo Tree Search (MCTS)
Tool Use
- autogen chunking
- qwen-plus-128k
- text-embedding v3
Frameworks
- graph-based agent
Is Agentic
true
Architectures
- DAG (directed acyclic graph)
Optimization Features
Token Efficiency
- synopsis compression
- graph reuse to amortize cost
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Validated only on QuALITY, MuSiQue and NarrativeQA; generalization to other domains unknown.
- Relies on Qwen API for node extraction; graph quality depends on that pipeline.
- Graph construction is complex and adds up‑front token cost per dataset.
When Not To Use
- When per‑query token or latency budgets are very tight and you cannot amortize graph construction.
- For short documents where full-context LLMs already handle the input efficiently.
- If you cannot operate external APIs required for node extraction.
Failure Modes
- Noisy or incorrect node extraction leads to wrong edges and hallucinated answers.
- MCTS may focus on keyword overlap and miss semantically relevant nodes if keywords are sparse.
- Large upfront token use during graph building can make experiments expensive without reuse.
Core Entities
Models
- qwen-plus-128k
- GPT-4-128k
Metrics
- Accuracy
- ROUGE-1
- ROUGE-2
- ROUGE-L
- F1
- LR-1
- LR-2
Datasets
- QuALITY
- MuSiQue
- NarrativeQA
Benchmarks
- QuALITY
- MuSiQue
- NarrativeQA

