SKETCH: combine semantic chunking with knowledge graphs to improve RAG retrieval for complex, multi-context queries

December 19, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

0

Authors

Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary

Links

Abstract / PDF

Why It Matters For Business

SKETCH gives more accurate, context-preserving retrieval for complex, multi-part queries, which improves downstream answers and traceability at the cost of higher KG construction and LLM use.

Summary TLDR

SKETCH combines semantic chunking (split text into meaning-preserving units) with a knowledge graph (structured entities and relations) and a hybrid retriever to improve Retrieval-Augmented Generation (RAG). Evaluated using RAGAS metrics on four datasets (Italian Cuisine, QuALITY, QASPER, NarrativeQA), SKETCH raises answer relevancy and context precision versus Naive RAG and several baselines. Gains are largest for small-domain tests and long-document comprehension, but building large KGs and relying on GPT models raises cost and reproducibility concerns.

Problem Statement

Current RAG systems often lose context when they split text arbitrarily and struggle to combine evidence spread across distant parts of a corpus. This reduces answer relevancy and multi-hop reasoning for complex queries.

Main Contribution

SKETCH: a hybrid retrieval method that fuses semantic chunking (meaningful text chunks) with a knowledge graph (entities + relations).

A concrete indexing pipeline: semantic splitting, recursive character splitting (100-token chunks, 16-token overlap), FAISS embeddings, and a KG built from extracted entities.

A hybrid query flow: GPT‑4 NER -> cypher queries on KG (structured) plus cosine-similarity over FAISS embeddings (unstructured), then merge results using token overlap as confirmation.

Empirical evaluation across four datasets (Italian Cuisine, QuALITY, QASPER, NarrativeQA) using RAGAS metrics and GPT‑3.5-turbo-16k as an automatic judge.

Key Findings

On the small Italian Cuisine test, SKETCH reached very high relevancy and precision.

Numbersanswer_relevancy=0.94; context_precision=0.99

On QuALITY (long-document comprehension), SKETCH improved answer relevancy over Naive RAG.

Numbersanswer_relevancy SKETCH=0.73 vs Naive RAG=0.49 (+49%)

On QASPER (scientific papers), SKETCH achieved very high faithfulness.

Numbersfaithfulness=0.93 (SKETCH) vs Naive RAG=0.61

On NarrativeQA (narrative understanding), SKETCH provided balanced gains in relevancy and faithfulness.

Numbersanswer_relevancy=0.50; faithfulness=0.87

SKETCH depends on GPT models for entity extraction and evaluation, which introduces variability and cost.

Numbersuses GPT‑4 for NER; GPT‑3.5-turbo-16k as judge (stated in paper)

Results

Italian Cuisine - answer_relevancy

Value0.94

BaselineNaive RAG=0.61

Italian Cuisine - context_precision

Value0.99

BaselineNaive RAG=0.81

QuALITY - answer_relevancy

Value0.73

BaselineNaive RAG=0.49

QASPER - faithfulness

Value0.93

BaselineNaive RAG=0.61

NarrativeQA - answer_relevancy

Value0.50

BaselineNaive RAG=0.08

Who Should Care

What To Try In 7 Days

Run semantic chunking on a small domain corpus and index embeddings into FAISS to see immediate gains in retrieval.

Extract entities with an LLM for a small subset, build a toy KG, and run cypher queries to compare structured vs unstructured hits.

Combine KG results and embedding results with a simple overlap-weight rule and measure relevancy on a handful of multi-context queries using an LLM judge.

Reproducibility

Data Urls

  • QuALITY (public)
  • QASPER (public)
  • NarrativeQA (public)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • KG construction is labor intensive and may not scale cheaply to very large corpora.
  • Dependency on GPT models for NER and judging increases cost and adds variance from sampling and prompt sensitivity.
  • Faithfulness still lags Naive RAG on at least one benchmark (QuALITY), showing uneven gains across metrics.
  • Context recall can be lower than KG-only in some datasets, indicating a precision/recall trade-off.

When Not To Use

  • If you cannot afford LLM API costs or KG construction at scale.
  • When absolute recall is required and a simpler KG-only approach already gives higher recall.
  • If you need strict reproducibility and cannot run multiple judge seeds or aggregate judgments.

Failure Modes

  • Erroneous entity extraction from GPT NER leading to wrong KG traversals.
  • Sparse or incomplete KG causing missed multi-hop links and low recall.
  • LLM judge or NER hallucinations injecting noisy signals into evaluation and retrieval.
  • Added latency and cost from hybrid queries making real-time use impractical.

Core Entities

Models

  • GPT-4 (NER)
  • GPT-3.5-turbo-16k (automatic judge)

Metrics

  • answer_relevancy
  • faithfulness
  • context_precision
  • context_recall
  • context-F1

Datasets

  • Italian Cuisine (internal, 3 files)
  • QuALITY (validation/train sets used)
  • QASPER (validation)
  • NarrativeQA (validation)

Benchmarks

  • RAGAS framework (answer_relevancy, faithfulness, context_precision, context_recall)

Context Entities

Models

  • SBERT (referenced for embeddings/related work)