Build a DAG of chunk synopses and use MCTS to find relevant facts for long‑context QA

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Zhirui Chen, Wei Shen, Jiashui Huang, Ling Shao

Links

Abstract / PDF

Why It Matters For Business

JERR improves answer accuracy and long‑range recall on long documents while producing an interpretable graph of facts; build once and reuse graphs to amortize cost.

Summary TLDR

JERR is an agent-style pipeline that turns long documents into chunk synopses, deduplicates entities, builds a directed acyclic graph (DAG) of facts, then uses Monte Carlo Tree Search (MCTS) to pick the most relevant graph nodes for answering questions. On three long‑context QA benchmarks (QuALITY, MuSiQue, NarrativeQA) JERR improves accuracy and recall metrics versus retrieval and agent baselines. It costs more tokens during graph construction but supports graph reuse to cut per‑query cost.

Problem Statement

Transformer LLMs struggle to reason over very long inputs and to retrieve the small set of facts needed for complex questions. We need a practical, interpretable method that filters redundancy, preserves causal links, and finds relevant facts without retraining the LLM.

Main Contribution

JERR: a three-stage pipeline—synopsis extraction, deduplication + DAG graph construction, and MCTS-based graph search for reasoning.

A two-stage deduplication (Bloom Filter + Trie for exact matches, SimHash for near-duplicates) to compress facts before graph building.

Use of MCTS to select top‑k relevant graph nodes for a question, combined with synopses and selected original chunks to generate answers.

Empirical gains on three long‑context QA datasets and ablations showing MCTS and top-k settings matter; token cost analysis and graph reuse discussed.

Key Findings

JERR yields the best accuracy on QuALITY multi-choice QA.

Numbers86.39% (JERR) vs 85.02% (GraphRAG) (Table 2)

JERR improves long‑range recall on multi‑hop QA (MuSiQue).

NumbersLR-1 0.455 vs 0.410; LR-2 0.595 vs 0.550 (Table 3)

MCTS selection beats PageRank for node selection.

Numbers86.39% (MCTS) vs 81.69% (PageRank), +4.7 pp (Table 4)

Graph construction increases token cost but can be amortized by reuse.

NumbersAvg tokens per query 98.54k (JERR) vs 79.98k (ReadAgent); reuse → 44.33k (Table 6)

Results

Accuracy

Value86.39%

BaselineGraphRAG 85.02%

MuSiQue LR-1 / LR-2 / F1

Value0.455 / 0.595 / 0.505

BaselineGraphRAG 0.410 / 0.550 / 0.488

NarrativeQA R-1 / R-2 / F1

Value0.234 / 0.215 / 0.269

BaselineGraphRAG 0.221 / 0.205 / 0.254

Accuracy

Value86.39% vs 81.69%

BaselineJERR w/ PageRank 81.69%

Token consumption per query

Value98.54k (JERR)

BaselineReadAgent 79.98k; qwen-plus-128k 16.99k

Who Should Care

Product ManagerMl EngineerEngineering LeadData Scientist

What To Try In 7 Days

Chunk a sample long document, extract synopses, and build a small DAG with Bloom+SimHash dedupe to test graph quality.

Run MCTS to pick top‑k nodes (k=5 default) and compare answers to a simple retrieval baseline.

Measure token and latency cost; test graph reuse to see per‑query savings.

Agent Features

Memory

synopsis summaries (short structured memory)

Planning

Monte Carlo Tree Search (MCTS)

Tool Use

autogen chunking
qwen-plus-128k
text-embedding v3

Frameworks

graph-based agent

Is Agentic

true

Architectures

DAG (directed acyclic graph)

Optimization Features

Token Efficiency

synopsis compression
graph reuse to amortize cost

Reproducibility

Data Urls

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Validated only on QuALITY, MuSiQue and NarrativeQA; generalization to other domains unknown.
Relies on Qwen API for node extraction; graph quality depends on that pipeline.
Graph construction is complex and adds up‑front token cost per dataset.

When Not To Use

When per‑query token or latency budgets are very tight and you cannot amortize graph construction.
For short documents where full-context LLMs already handle the input efficiently.
If you cannot operate external APIs required for node extraction.

Failure Modes

Noisy or incorrect node extraction leads to wrong edges and hallucinated answers.
MCTS may focus on keyword overlap and miss semantically relevant nodes if keywords are sparse.
Large upfront token use during graph building can make experiments expensive without reuse.

Core Entities

Models

qwen-plus-128k
GPT-4-128k

Metrics

Accuracy
ROUGE-1
ROUGE-2
ROUGE-L
F1
LR-1
LR-2

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

JERR yields the best accuracy on QuALITY multi-choice QA.

JERR improves long‑range recall on multi‑hop QA (MuSiQue).

MCTS selection beats PageRank for node selection.

Graph construction increases token cost but can be amortized by reuse.

Results

Accuracy

MuSiQue LR-1 / LR-2 / F1

NarrativeQA R-1 / R-2 / F1

Accuracy

Token consumption per query

Who Should Care

What To Try In 7 Days

Agent Features

Memory

Planning

Tool Use

Frameworks

Is Agentic

Architectures

Optimization Features

Token Efficiency

Reproducibility

Data Urls

Data Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Related Papers