Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

June 2, 20258 min

Overview

Decision SnapshotReady For Pilot

Method is practical and evaluated across many public benchmarks; expect engineering work to integrate teacher calls and redaction pipelines before production.

Citations0

Evidence Strength0.75

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Jennifer Chen, Aidar Myrzakhan, Yaxin Luo, Hassaan Muhammad Khan, Sondos Mahmoud Bsharat, Zhiqiang Shen

Links

Abstract / PDF / Code

Why It Matters For Business

DRAG lets organizations run fact-grounded generation on smaller, local models to cut cloud costs, reduce latency, and limit sensitive data exposure while keeping much of the accuracy of large RAG systems.

Who Should Care

Summary TLDR

DRAG is a practical distillation pipeline that uses a large LLM to generate ranked textual evidence and relationship triples, then feeds a filtered set of evidences and a compact knowledge graph to a small local model. The distilled small models (SLMs) gain much of the factual grounding of big RAG systems while using far less compute. Key wins: evidence-driven distillation outperforms graph-only distillation, ~15 evidences is a good trade-off, graphs cut token costs (~18%), and a privacy filter reduces injected PII by 95.7%. Code is released.

Problem Statement

Large retrieval-augmented systems give better factual answers but are too heavy and cloud-bound for many real deployments. Smaller local models lack that grounding and hallucinate. The paper asks: can we transfer RAG-style evidence and graph reasoning from big LLMs to small LMs to boost factuality and preserve privacy and efficiency?

Main Contribution

DRAG: a four-step distillation pipeline that uses a teacher LLM to produce evidence, rank it, extract relationship triples, and provide a compact prompt for a student SLM.

A privacy-preserving workflow and a synthetic privacy-leakage benchmark showing how SLMs can redact PII before querying cloud teachers.

Key Findings

DRAG outperforms prior small-model RAG baselines on ARC-Challenge by up to +27.7% under the same backbones.

NumbersARC-C: +27.7% vs MiniRAG (Table 1)

Practical UseIf you run a small LLM with DRAG, expect large boosts in multiple-choice science QA compared to simple MiniRAG-style methods; swap in DRAG-ranked evidence to improve accuracy fast.

Evidence RefTable 1

Converting evidences to a compact graph cuts token length by about 18.1% on average.

NumbersAvg tokens: evidence 26.30 → graph 21.55 (−18.1%)

Practical UseUse graph aggregation when token-budget or latency matters: you trade some raw text for structured triples and save ~18% token cost at inference.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyup to 94.1%varies by SLM (original baseline e.g., 5363%)up to +27.7% vs MiniRAG under same backboneARC-C (multiple SLMs)Table 1 reports DRAG ARC-C scores up to 94.1%Table 1
AccuracyGemma-2-9b-it 53.45% (DRAG E)46.44% (original)+7.01% absoluteOpen-LLM LeaderboardTable 6 shows evidence-based distillation increases Gemma-2-9b-it from 46.44% to 53.45%Table 6

What To Try In 7 Days

Run DRAG evidence-only distillation with ~15 evidences on one SLM and an available teacher LLM to test accuracy lift on a pilot dataset.

Add the simple graph aggregation to cut token use and measure latency and cost differences.

Implement a local PII-redaction step before teacher calls and measure PII leakage and accuracy trade-offs.

Optimization Features

Token Efficiency
graph reduces avg tokens by ~18.1%
Model Optimization
distillation of retrieval reasoning into SLMsfinetuning-free evidence transfer (prompt-based)
System Optimization
push heavy retrieval to cloud teacher; keep SLM local for final generation
Training Optimization
teacher-generated evidence replaces large-scale retriever trainingrank-and-filter to reduce training/inference inputs
Inference Optimization
graph aggregation to shorten contextstop-K evidence/relationship filtering

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Some nuanced or implicit knowledge from teacher LLMs may be lost during distillation, hurting creative or subjective tasks.

The distillation process itself requires non-trivial compute to generate and rank many evidences and graphs.

When Not To Use

Tasks that need deeply implicit, creative, or subjective reasoning where explicit evidence is not helpful.

Ultra-low-latency systems where any cloud teacher call is unacceptable.

Failure Modes

Teacher bias or incorrect evidence propagates into the student, causing consistent but wrong answers.

Overfiltering evidences can remove crucial facts and reduce accuracy.

Core Entities

Models

GPT-4oGPT-4o-miniDeepSeek-V3Gemini 1.5 FlashClaude 3.5 SonnetLLaMA-3.3-70BGemma-2-9b-itGemma-2-2b-itGemma-22B-itPhi-3.5-mini-instructQwen2.5-3B-InstructQwen2.5-7B-InstructLLaMA-3.1-8B-InstructLLaMA-3.2-3B-InstructBLOOM-7bGPT-3.5-Turbo

Metrics

Accuracytoken lengthPII reduction

Datasets

ARC-ChallengeMedMCQAGPQAMMLUOpen-LLM-LeaderboardAVERITECWebQuestionsMMLU-privacy-augmented (authors)

Benchmarks

ARC-CMedMCQAGPQAMMLUOpen-LLM LeaderboardAVERITECWebQuestions