Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Overview

Decision SnapshotReady For Pilot

Method is practical and evaluated across many public benchmarks; expect engineering work to integrate teacher calls and redaction pipelines before production.

Citations0

Evidence Strength0.75

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Jennifer Chen, Aidar Myrzakhan, Yaxin Luo, Hassaan Muhammad Khan, Sondos Mahmoud Bsharat, Zhiqiang Shen

Links

Abstract / PDF / Code

Why It Matters For Business

DRAG lets organizations run fact-grounded generation on smaller, local models to cut cloud costs, reduce latency, and limit sensitive data exposure while keeping much of the accuracy of large RAG systems.

Who Should Care

ML Engineer Product Manager CTO Engineering Lead Data Scientist

Summary TLDR

DRAG is a practical distillation pipeline that uses a large LLM to generate ranked textual evidence and relationship triples, then feeds a filtered set of evidences and a compact knowledge graph to a small local model. The distilled small models (SLMs) gain much of the factual grounding of big RAG systems while using far less compute. Key wins: evidence-driven distillation outperforms graph-only distillation, ~15 evidences is a good trade-off, graphs cut token costs (~18%), and a privacy filter reduces injected PII by 95.7%. Code is released.

Problem Statement

Large retrieval-augmented systems give better factual answers but are too heavy and cloud-bound for many real deployments. Smaller local models lack that grounding and hallucinate. The paper asks: can we transfer RAG-style evidence and graph reasoning from big LLMs to small LMs to boost factuality and preserve privacy and efficiency?

Main Contribution

DRAG: a four-step distillation pipeline that uses a teacher LLM to produce evidence, rank it, extract relationship triples, and provide a compact prompt for a student SLM.

A privacy-preserving workflow and a synthetic privacy-leakage benchmark showing how SLMs can redact PII before querying cloud teachers.

Key Findings

DRAG outperforms prior small-model RAG baselines on ARC-Challenge by up to +27.7% under the same backbones.

NumbersARC-C: +27.7% vs MiniRAG (Table 1)

Practical UseIf you run a small LLM with DRAG, expect large boosts in multiple-choice science QA compared to simple MiniRAG-style methods; swap in DRAG-ranked evidence to improve accuracy fast.

Evidence RefTable 1

Converting evidences to a compact graph cuts token length by about 18.1% on average.

NumbersAvg tokens: evidence 26.30 → graph 21.55 (−18.1%)

Practical UseUse graph aggregation when token-budget or latency matters: you trade some raw text for structured triples and save ~18% token cost at inference.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	up to 94.1%	varies by SLM (original baseline e.g., 53–63%)	up to +27.7% vs MiniRAG under same backbone	ARC-C (multiple SLMs)	Table 1 reports DRAG ARC-C scores up to 94.1%	Table 1
Accuracy	Gemma-2-9b-it 53.45% (DRAG E)	46.44% (original)	+7.01% absolute	Open-LLM Leaderboard	Table 6 shows evidence-based distillation increases Gemma-2-9b-it from 46.44% to 53.45%	Table 6

What To Try In 7 Days

Run DRAG evidence-only distillation with ~15 evidences on one SLM and an available teacher LLM to test accuracy lift on a pilot dataset.

Add the simple graph aggregation to cut token use and measure latency and cost differences.

Implement a local PII-redaction step before teacher calls and measure PII leakage and accuracy trade-offs.

Optimization Features

Token Efficiency

graph reduces avg tokens by ~18.1%

Model Optimization

distillation of retrieval reasoning into SLMsfinetuning-free evidence transfer (prompt-based)

System Optimization

push heavy retrieval to cloud teacher; keep SLM local for final generation

Training Optimization

teacher-generated evidence replaces large-scale retriever trainingrank-and-filter to reduce training/inference inputs

Inference Optimization

graph aggregation to shorten contextstop-K evidence/relationship filtering

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/VILA-Lab/DRAG

Risks & Boundaries

Limitations

Some nuanced or implicit knowledge from teacher LLMs may be lost during distillation, hurting creative or subjective tasks.

The distillation process itself requires non-trivial compute to generate and rank many evidences and graphs.

When Not To Use

Tasks that need deeply implicit, creative, or subjective reasoning where explicit evidence is not helpful.

Ultra-low-latency systems where any cloud teacher call is unacceptable.

Failure Modes

Teacher bias or incorrect evidence propagates into the student, causing consistent but wrong answers.

Overfiltering evidences can remove crucial facts and reduce accuracy.

Core Entities

Models

GPT-4oGPT-4o-miniDeepSeek-V3Gemini 1.5 FlashClaude 3.5 SonnetLLaMA-3.3-70BGemma-2-9b-itGemma-2-2b-itGemma-22B-itPhi-3.5-mini-instructQwen2.5-3B-InstructQwen2.5-7B-InstructLLaMA-3.1-8B-InstructLLaMA-3.2-3B-InstructBLOOM-7bGPT-3.5-Turbo

Metrics

Accuracytoken lengthPII reduction

Datasets

ARC-ChallengeMedMCQAGPQAMMLUOpen-LLM-LeaderboardAVERITECWebQuestionsMMLU-privacy-augmented (authors)

Benchmarks

ARC-CMedMCQAGPQAMMLUOpen-LLM LeaderboardAVERITECWebQuestions

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DRAG outperforms prior small-model RAG baselines on ARC-Challenge by up to +27.7% under the same backbones.

Converting evidences to a compact graph cuts token length by about 18.1% on average.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding