SKETCH: combine semantic chunking with knowledge graphs to improve RAG retrieval for complex, multi-context queries

December 19, 20248 min

Overview

Decision SnapshotNeeds Validation

SKETCH is a practical hybrid retrieval method with clear per-dataset gains; it is ready for prototyping but costly at scale due to KG build and LLM dependence.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary

Links

Abstract / PDF / Data

Why It Matters For Business

SKETCH gives more accurate, context-preserving retrieval for complex, multi-part queries, which improves downstream answers and traceability at the cost of higher KG construction and LLM use.

Who Should Care

Summary TLDR

SKETCH combines semantic chunking (split text into meaning-preserving units) with a knowledge graph (structured entities and relations) and a hybrid retriever to improve Retrieval-Augmented Generation (RAG). Evaluated using RAGAS metrics on four datasets (Italian Cuisine, QuALITY, QASPER, NarrativeQA), SKETCH raises answer relevancy and context precision versus Naive RAG and several baselines. Gains are largest for small-domain tests and long-document comprehension, but building large KGs and relying on GPT models raises cost and reproducibility concerns.

Problem Statement

Current RAG systems often lose context when they split text arbitrarily and struggle to combine evidence spread across distant parts of a corpus. This reduces answer relevancy and multi-hop reasoning for complex queries.

Main Contribution

SKETCH: a hybrid retrieval method that fuses semantic chunking (meaningful text chunks) with a knowledge graph (entities + relations).

A concrete indexing pipeline: semantic splitting, recursive character splitting (100-token chunks, 16-token overlap), FAISS embeddings, and a KG built from extracted entities.

Key Findings

On the small Italian Cuisine test, SKETCH reached very high relevancy and precision.

Numbersanswer_relevancy=0.94; context_precision=0.99

Practical UseFor focused domain corpora, combining KG + semantic chunks can deliver near-perfect context precision and highly relevant answers; use SKETCH when high precision matters.

Evidence RefTable 1 (Italian Cuisine)

On QuALITY (long-document comprehension), SKETCH improved answer relevancy over Naive RAG.

Numbersanswer_relevancy SKETCH=0.73 vs Naive RAG=0.49 (+49%)

Practical UseFor long passages, semantic chunking + KG helps find relevant material across a long text—try SKETCH when queries need whole-document understanding.

Evidence RefTable 2 (QuALITY)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Italian Cuisine - answer_relevancy0.94Naive RAG=0.61+54.1%Italian CuisineTable 1: SKETCH=0.94 vs Naive RAG=0.61Table 1
Italian Cuisine - context_precision0.99Naive RAG=0.81+22.2%Italian CuisineTable 1: SKETCH context_precision=0.99Table 1

What To Try In 7 Days

Run semantic chunking on a small domain corpus and index embeddings into FAISS to see immediate gains in retrieval.

Extract entities with an LLM for a small subset, build a toy KG, and run cypher queries to compare structured vs unstructured hits.

Combine KG results and embedding results with a simple overlap-weight rule and measure relevancy on a handful of multi-context queries using an LLM judge.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

QuALITY (public)QASPER (public)NarrativeQA (public)

Risks & Boundaries

Limitations

KG construction is labor intensive and may not scale cheaply to very large corpora.

Dependency on GPT models for NER and judging increases cost and adds variance from sampling and prompt sensitivity.

When Not To Use

If you cannot afford LLM API costs or KG construction at scale.

When absolute recall is required and a simpler KG-only approach already gives higher recall.

Failure Modes

Erroneous entity extraction from GPT NER leading to wrong KG traversals.

Sparse or incomplete KG causing missed multi-hop links and low recall.

Core Entities

Models

GPT-4 (NER)GPT-3.5-turbo-16k (automatic judge)

Metrics

answer_relevancyfaithfulnesscontext_precisioncontext_recallcontext-F1

Datasets

Italian Cuisine (internal, 3 files)QuALITY (validation/train sets used)QASPER (validation)NarrativeQA (validation)

Benchmarks

RAGAS framework (answer_relevancy, faithfulness, context_precision, context_recall)

Context Entities

Models

SBERT (referenced for embeddings/related work)