Overview
Method is practical and evaluated across many public benchmarks; expect engineering work to integrate teacher calls and redaction pipelines before production.
Citations0
Evidence Strength0.75
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
DRAG lets organizations run fact-grounded generation on smaller, local models to cut cloud costs, reduce latency, and limit sensitive data exposure while keeping much of the accuracy of large RAG systems.
Who Should Care
Summary TLDR
DRAG is a practical distillation pipeline that uses a large LLM to generate ranked textual evidence and relationship triples, then feeds a filtered set of evidences and a compact knowledge graph to a small local model. The distilled small models (SLMs) gain much of the factual grounding of big RAG systems while using far less compute. Key wins: evidence-driven distillation outperforms graph-only distillation, ~15 evidences is a good trade-off, graphs cut token costs (~18%), and a privacy filter reduces injected PII by 95.7%. Code is released.
Problem Statement
Large retrieval-augmented systems give better factual answers but are too heavy and cloud-bound for many real deployments. Smaller local models lack that grounding and hallucinate. The paper asks: can we transfer RAG-style evidence and graph reasoning from big LLMs to small LMs to boost factuality and preserve privacy and efficiency?
Main Contribution
DRAG: a four-step distillation pipeline that uses a teacher LLM to produce evidence, rank it, extract relationship triples, and provide a compact prompt for a student SLM.
A privacy-preserving workflow and a synthetic privacy-leakage benchmark showing how SLMs can redact PII before querying cloud teachers.
Key Findings
DRAG outperforms prior small-model RAG baselines on ARC-Challenge by up to +27.7% under the same backbones.
Converting evidences to a compact graph cuts token length by about 18.1% on average.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | up to 94.1% | varies by SLM (original baseline e.g., 53–63%) | up to +27.7% vs MiniRAG under same backbone | ARC-C (multiple SLMs) | Table 1 reports DRAG ARC-C scores up to 94.1% | Table 1 |
| Accuracy | Gemma-2-9b-it 53.45% (DRAG E) | 46.44% (original) | +7.01% absolute | Open-LLM Leaderboard | Table 6 shows evidence-based distillation increases Gemma-2-9b-it from 46.44% to 53.45% | Table 6 |
What To Try In 7 Days
Run DRAG evidence-only distillation with ~15 evidences on one SLM and an available teacher LLM to test accuracy lift on a pilot dataset.
Add the simple graph aggregation to cut token use and measure latency and cost differences.
Implement a local PII-redaction step before teacher calls and measure PII leakage and accuracy trade-offs.
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Some nuanced or implicit knowledge from teacher LLMs may be lost during distillation, hurting creative or subjective tasks.
The distillation process itself requires non-trivial compute to generate and rank many evidences and graphs.
When Not To Use
Tasks that need deeply implicit, creative, or subjective reasoning where explicit evidence is not helpful.
Ultra-low-latency systems where any cloud teacher call is unacceptable.
Failure Modes
Teacher bias or incorrect evidence propagates into the student, causing consistent but wrong answers.
Overfiltering evidences can remove crucial facts and reduce accuracy.

