Build a DAG of chunk synopses and use MCTS to find relevant facts for long‑context QA

August 28, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Zhirui Chen, Wei Shen, Jiashui Huang, Ling Shao

Links

Abstract / PDF

Why It Matters For Business

JERR improves answer accuracy and long‑range recall on long documents while producing an interpretable graph of facts; build once and reuse graphs to amortize cost.

Summary TLDR

JERR is an agent-style pipeline that turns long documents into chunk synopses, deduplicates entities, builds a directed acyclic graph (DAG) of facts, then uses Monte Carlo Tree Search (MCTS) to pick the most relevant graph nodes for answering questions. On three long‑context QA benchmarks (QuALITY, MuSiQue, NarrativeQA) JERR improves accuracy and recall metrics versus retrieval and agent baselines. It costs more tokens during graph construction but supports graph reuse to cut per‑query cost.

Problem Statement

Transformer LLMs struggle to reason over very long inputs and to retrieve the small set of facts needed for complex questions. We need a practical, interpretable method that filters redundancy, preserves causal links, and finds relevant facts without retraining the LLM.

Main Contribution

JERR: a three-stage pipeline—synopsis extraction, deduplication + DAG graph construction, and MCTS-based graph search for reasoning.

A two-stage deduplication (Bloom Filter + Trie for exact matches, SimHash for near-duplicates) to compress facts before graph building.

Use of MCTS to select top‑k relevant graph nodes for a question, combined with synopses and selected original chunks to generate answers.

Empirical gains on three long‑context QA datasets and ablations showing MCTS and top-k settings matter; token cost analysis and graph reuse discussed.

Key Findings

JERR yields the best accuracy on QuALITY multi-choice QA.

Numbers86.39% (JERR) vs 85.02% (GraphRAG) (Table 2)

JERR improves long‑range recall on multi‑hop QA (MuSiQue).

NumbersLR-1 0.455 vs 0.410; LR-2 0.595 vs 0.550 (Table 3)

MCTS selection beats PageRank for node selection.

Numbers86.39% (MCTS) vs 81.69% (PageRank), +4.7 pp (Table 4)

Graph construction increases token cost but can be amortized by reuse.

NumbersAvg tokens per query 98.54k (JERR) vs 79.98k (ReadAgent); reuse → 44.33k (Table 6)

Results

Accuracy

Value86.39%

BaselineGraphRAG 85.02%

MuSiQue LR-1 / LR-2 / F1

Value0.455 / 0.595 / 0.505

BaselineGraphRAG 0.410 / 0.550 / 0.488

NarrativeQA R-1 / R-2 / F1

Value0.234 / 0.215 / 0.269

BaselineGraphRAG 0.221 / 0.205 / 0.254

Accuracy

Value86.39% vs 81.69%

BaselineJERR w/ PageRank 81.69%

Token consumption per query

Value98.54k (JERR)

BaselineReadAgent 79.98k; qwen-plus-128k 16.99k

Who Should Care

What To Try In 7 Days

Chunk a sample long document, extract synopses, and build a small DAG with Bloom+SimHash dedupe to test graph quality.

Run MCTS to pick top‑k nodes (k=5 default) and compare answers to a simple retrieval baseline.

Measure token and latency cost; test graph reuse to see per‑query savings.

Agent Features

Memory

  • synopsis summaries (short structured memory)

Planning

  • Monte Carlo Tree Search (MCTS)

Tool Use

  • autogen chunking
  • qwen-plus-128k
  • text-embedding v3

Frameworks

  • graph-based agent

Is Agentic

true

Architectures

  • DAG (directed acyclic graph)

Optimization Features

Token Efficiency

  • synopsis compression
  • graph reuse to amortize cost

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Validated only on QuALITY, MuSiQue and NarrativeQA; generalization to other domains unknown.
  • Relies on Qwen API for node extraction; graph quality depends on that pipeline.
  • Graph construction is complex and adds up‑front token cost per dataset.

When Not To Use

  • When per‑query token or latency budgets are very tight and you cannot amortize graph construction.
  • For short documents where full-context LLMs already handle the input efficiently.
  • If you cannot operate external APIs required for node extraction.

Failure Modes

  • Noisy or incorrect node extraction leads to wrong edges and hallucinated answers.
  • MCTS may focus on keyword overlap and miss semantically relevant nodes if keywords are sparse.
  • Large upfront token use during graph building can make experiments expensive without reuse.

Core Entities

Models

  • qwen-plus-128k
  • GPT-4-128k

Metrics

  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • F1
  • LR-1
  • LR-2

Datasets

  • QuALITY
  • MuSiQue
  • NarrativeQA

Benchmarks

  • QuALITY
  • MuSiQue
  • NarrativeQA