SKETCH: combine semantic chunking with knowledge graphs to improve RAG retrieval for complex, multi-context queries

Overview

Decision SnapshotNeeds Validation

SKETCH is a practical hybrid retrieval method with clear per-dataset gains; it is ready for prototyping but costly at scale due to KG build and LLM dependence.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary

Links

Abstract / PDF / Data

Why It Matters For Business

SKETCH gives more accurate, context-preserving retrieval for complex, multi-part queries, which improves downstream answers and traceability at the cost of higher KG construction and LLM use.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

SKETCH combines semantic chunking (split text into meaning-preserving units) with a knowledge graph (structured entities and relations) and a hybrid retriever to improve Retrieval-Augmented Generation (RAG). Evaluated using RAGAS metrics on four datasets (Italian Cuisine, QuALITY, QASPER, NarrativeQA), SKETCH raises answer relevancy and context precision versus Naive RAG and several baselines. Gains are largest for small-domain tests and long-document comprehension, but building large KGs and relying on GPT models raises cost and reproducibility concerns.

Problem Statement

Current RAG systems often lose context when they split text arbitrarily and struggle to combine evidence spread across distant parts of a corpus. This reduces answer relevancy and multi-hop reasoning for complex queries.

Main Contribution

SKETCH: a hybrid retrieval method that fuses semantic chunking (meaningful text chunks) with a knowledge graph (entities + relations).

A concrete indexing pipeline: semantic splitting, recursive character splitting (100-token chunks, 16-token overlap), FAISS embeddings, and a KG built from extracted entities.

Key Findings

On the small Italian Cuisine test, SKETCH reached very high relevancy and precision.

Numbersanswer_relevancy=0.94; context_precision=0.99

Practical UseFor focused domain corpora, combining KG + semantic chunks can deliver near-perfect context precision and highly relevant answers; use SKETCH when high precision matters.

Evidence RefTable 1 (Italian Cuisine)

On QuALITY (long-document comprehension), SKETCH improved answer relevancy over Naive RAG.

Numbersanswer_relevancy SKETCH=0.73 vs Naive RAG=0.49 (+49%)

Practical UseFor long passages, semantic chunking + KG helps find relevant material across a long text—try SKETCH when queries need whole-document understanding.

Evidence RefTable 2 (QuALITY)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Italian Cuisine - answer_relevancy	0.94	Naive RAG=0.61	+54.1%	Italian Cuisine	Table 1: SKETCH=0.94 vs Naive RAG=0.61	Table 1
Italian Cuisine - context_precision	0.99	Naive RAG=0.81	+22.2%	Italian Cuisine	Table 1: SKETCH context_precision=0.99	Table 1

What To Try In 7 Days

Run semantic chunking on a small domain corpus and index embeddings into FAISS to see immediate gains in retrieval.

Extract entities with an LLM for a small subset, build a toy KG, and run cypher queries to compare structured vs unstructured hits.

Combine KG results and embedding results with a simple overlap-weight rule and measure relevancy on a handful of multi-context queries using an LLM judge.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

QuALITY (public)QASPER (public)NarrativeQA (public)

Risks & Boundaries

Limitations

KG construction is labor intensive and may not scale cheaply to very large corpora.

Dependency on GPT models for NER and judging increases cost and adds variance from sampling and prompt sensitivity.

When Not To Use

If you cannot afford LLM API costs or KG construction at scale.

When absolute recall is required and a simpler KG-only approach already gives higher recall.

Failure Modes

Erroneous entity extraction from GPT NER leading to wrong KG traversals.

Sparse or incomplete KG causing missed multi-hop links and low recall.

Core Entities

Models

GPT-4 (NER)GPT-3.5-turbo-16k (automatic judge)

Metrics

answer_relevancyfaithfulnesscontext_precisioncontext_recallcontext-F1

Datasets

Italian Cuisine (internal, 3 files)QuALITY (validation/train sets used)QASPER (validation)NarrativeQA (validation)

Benchmarks

RAGAS framework (answer_relevancy, faithfulness, context_precision, context_recall)

Context Entities

Models

SBERT (referenced for embeddings/related work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On the small Italian Cuisine test, SKETCH reached very high relevancy and precision.

On QuALITY (long-document comprehension), SKETCH improved answer relevancy over Naive RAG.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding