PIKE-RAG: make RAG work on industrial, domain-specific queries using 'atomic' knowledge and rationale-aware decomposition

January 20, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, Jiang Bian

Links

Abstract / PDF

Why It Matters For Business

PIKE-RAG turns heterogeneous, domain-specific documents into a structured KB and iteratively reasons with atomized facts; this reduces incorrect answers in legal, medical, and engineering QA and speeds production deployment of RAG-powered tools.

Summary TLDR

PIKE-RAG is a modular RAG framework aimed at industrial, domain-specific tasks. It builds a multi-layer heterogeneous knowledge graph, extracts small "atomic" knowledge items (questions that each chunk can answer), and runs knowledge-aware task decomposition to iteratively retrieve and reason. The paper shows consistent gains on multi-hop open benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) and legal benchmarks by combining hierarchical retrieval, atomized knowledge, auto-tagging, and a trainable decomposition proposer. Code is released.

Problem Statement

Standard RAG systems rely on plain-text retrieval and generic chunking. They struggle with diverse industrial corpora (tables, figures, references), domain jargon, multi-hop linking, and tasks that need prediction or creative solutions. The paper asks: how to extract, represent, and use specialized knowledge and rationale so RAG systems can scale from simple factual QA to prediction and creative tasks.

Main Contribution

A staged RAG paradigm (L0–L4) that defines capability levels from knowledge-base construction to multi-agent creative reasoning.

PIKE-RAG framework: multi-layer heterogeneous graph + modular pipeline for parsing, extraction, retrieval, organization, and knowledge-centric reasoning.

Knowledge atomizing: tag each chunk with many atomic questions to bridge query-corpus phrasing gaps and enable fine-grained retrieval.

Knowledge-aware task decomposition: iterative proposer that plans retrieval and reasoning using available atomic knowledge, and a data collection/trainable decomposer.

Empirical evaluation: consistent improvements across three multi-hop open benchmarks and legal benchmarks; ablations show benefit of hierarchical/atomic retrieval and fine-tuned atomic proposers.

Key Findings

PIKE-RAG improves multi-hop QA accuracy over baselines on HotpotQA.

NumbersAccuracy 87.6% (PIKE-RAG) vs 82.6% (Naive RAG w/ R)

PIKE-RAG yields the largest gains on harder multi-hop benchmarks.

NumbersMuSiQue EM 46.4 vs 32.0 (Naive RAG w/ R); Acc 59.6 vs 44.4

On legal generation tasks PIKE-RAG achieves high semantic accuracy.

NumbersLawBench task 1-1 Accuracy 90.12% (PIKE-RAG) vs 1.23% (Zero-Shot)

Fine-tuning small "atomic proposers" improves end-to-end performance.

NumbersMuSiQue eval: GPT-4o+FT 62.14% vs GPT-4o 47.83% (using Llama-3.1-8B proposer baseline)

Results

Accuracy

Value87.6%

BaselineNaive RAG w/ R 82.6%

2WikiMultiHopQA Exact Match (EM)

Value66.8%

BaselineNaive RAG w/ R 51.2%

MuSiQue F1

Value56.62

BaselineNaive RAG w/ R 43.31

Accuracy

Value90.12%

BaselineZero-Shot CoT 1.23%

Accuracy

Value98.59%

BaselineGraphRAG Local 88.27%

Who Should Care

What To Try In 7 Days

Build a small multi-layer KB for one domain: parse PDFs, extract chunks, and add atomic questions to test retrieval.

Implement auto-tagging: map plain-user terms to domain tags before retrieval to improve recall.

Run the iterative decomposition loop with an off-the-shelf LLM to see if atomic retrieval improves accuracy on a held-out set.

Agent Features

Memory

  • hierarchical knowledge base (graph + distilled layer)
  • atomic question index for chunks

Planning

  • task decomposition
  • knowledge-aware decomposition
  • iterative retrieval-generation loop

Tool Use

  • LangChain (file parsing example)
  • LoRA
  • text-embedding-ada-002 (embeddings)

Frameworks

  • PIKE-RAG

Is Agentic

true

Architectures

  • multi-layer heterogeneous graph
  • hierarchical retriever
  • multi-agent planning (L4)

Collaboration

  • multi-agent planning module for multi-perspective reasoning

Optimization Features

Token Efficiency

  • store atomic questions as compact indices to reduce retrieval tokens

Training Optimization

  • LoRA

Inference Optimization

  • limit final context to top-K atomic chunks to control cost

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Building and maintaining a multi-layer heterogeneous graph and distilled knowledge is resource-intensive and costly to scale.
  • The approach still depends on the base LLM for complex domain reasoning; LLM limits (hallucination, specialized logic) remain a bottleneck.
  • Atomic-question extraction and decomposer training require labeled trajectories or costly interaction sampling for good performance.

When Not To Use

  • For tiny corpora where flat retrieval is sufficient, the added pipeline complexity may not justify the benefits.
  • When compute or engineering resources cannot support KB construction, atomization, and decomposer fine-tuning.
  • For tasks where no coherent external corpus exists or where answers are purely subjective/creative without factual grounding.

Failure Modes

  • Decomposer proposes low-quality atomic queries, causing retrieval of irrelevant chunks and wrong answers.
  • Knowledge atomizing can generate redundant or noisy atomic questions, increasing retrieval noise and cost.
  • Knowledge graph construction errors or missing multimodal parsing (tables, figures) lead to blind spots and incorrect retrieval.

Core Entities

Models

  • GPT-4 (used as generator and evaluator)
  • GPT-4o (used in experiments)
  • Llama-3.1-70B-Instruct
  • meta-llama/Llama-3.1-8B
  • Qwen2.5-14B
  • phi-4-14B
  • text-embedding-ada-002

Metrics

  • Exact Match (EM)
  • F1
  • Accuracy
  • Precision
  • Recall

Datasets

  • HotpotQA
  • 2WikiMultiHopQA
  • MuSiQue
  • LawBench
  • Open Australian Legal QA

Benchmarks

  • HotpotQA
  • 2WikiMultiHopQA
  • MuSiQue
  • LawBench
  • Open Australian Legal QA

Context Entities

Models

  • GraphRAG (compared baseline)
  • Self-Ask (compared baseline)
  • Naive RAG (baseline)