Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
SKETCH gives more accurate, context-preserving retrieval for complex, multi-part queries, which improves downstream answers and traceability at the cost of higher KG construction and LLM use.
Summary TLDR
SKETCH combines semantic chunking (split text into meaning-preserving units) with a knowledge graph (structured entities and relations) and a hybrid retriever to improve Retrieval-Augmented Generation (RAG). Evaluated using RAGAS metrics on four datasets (Italian Cuisine, QuALITY, QASPER, NarrativeQA), SKETCH raises answer relevancy and context precision versus Naive RAG and several baselines. Gains are largest for small-domain tests and long-document comprehension, but building large KGs and relying on GPT models raises cost and reproducibility concerns.
Problem Statement
Current RAG systems often lose context when they split text arbitrarily and struggle to combine evidence spread across distant parts of a corpus. This reduces answer relevancy and multi-hop reasoning for complex queries.
Main Contribution
SKETCH: a hybrid retrieval method that fuses semantic chunking (meaningful text chunks) with a knowledge graph (entities + relations).
A concrete indexing pipeline: semantic splitting, recursive character splitting (100-token chunks, 16-token overlap), FAISS embeddings, and a KG built from extracted entities.
A hybrid query flow: GPT‑4 NER -> cypher queries on KG (structured) plus cosine-similarity over FAISS embeddings (unstructured), then merge results using token overlap as confirmation.
Empirical evaluation across four datasets (Italian Cuisine, QuALITY, QASPER, NarrativeQA) using RAGAS metrics and GPT‑3.5-turbo-16k as an automatic judge.
Key Findings
On the small Italian Cuisine test, SKETCH reached very high relevancy and precision.
On QuALITY (long-document comprehension), SKETCH improved answer relevancy over Naive RAG.
On QASPER (scientific papers), SKETCH achieved very high faithfulness.
On NarrativeQA (narrative understanding), SKETCH provided balanced gains in relevancy and faithfulness.
SKETCH depends on GPT models for entity extraction and evaluation, which introduces variability and cost.
Results
Italian Cuisine - answer_relevancy
Italian Cuisine - context_precision
QuALITY - answer_relevancy
QASPER - faithfulness
NarrativeQA - answer_relevancy
Who Should Care
What To Try In 7 Days
Run semantic chunking on a small domain corpus and index embeddings into FAISS to see immediate gains in retrieval.
Extract entities with an LLM for a small subset, build a toy KG, and run cypher queries to compare structured vs unstructured hits.
Combine KG results and embedding results with a simple overlap-weight rule and measure relevancy on a handful of multi-context queries using an LLM judge.
Reproducibility
Data Urls
- QuALITY (public)
- QASPER (public)
- NarrativeQA (public)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- KG construction is labor intensive and may not scale cheaply to very large corpora.
- Dependency on GPT models for NER and judging increases cost and adds variance from sampling and prompt sensitivity.
- Faithfulness still lags Naive RAG on at least one benchmark (QuALITY), showing uneven gains across metrics.
- Context recall can be lower than KG-only in some datasets, indicating a precision/recall trade-off.
When Not To Use
- If you cannot afford LLM API costs or KG construction at scale.
- When absolute recall is required and a simpler KG-only approach already gives higher recall.
- If you need strict reproducibility and cannot run multiple judge seeds or aggregate judgments.
Failure Modes
- Erroneous entity extraction from GPT NER leading to wrong KG traversals.
- Sparse or incomplete KG causing missed multi-hop links and low recall.
- LLM judge or NER hallucinations injecting noisy signals into evaluation and retrieval.
- Added latency and cost from hybrid queries making real-time use impractical.
Core Entities
Models
- GPT-4 (NER)
- GPT-3.5-turbo-16k (automatic judge)
Metrics
- answer_relevancy
- faithfulness
- context_precision
- context_recall
- context-F1
Datasets
- Italian Cuisine (internal, 3 files)
- QuALITY (validation/train sets used)
- QASPER (validation)
- NarrativeQA (validation)
Benchmarks
- RAGAS framework (answer_relevancy, faithfulness, context_precision, context_recall)
Context Entities
Models
- SBERT (referenced for embeddings/related work)

