Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can make task-specific LLMs fit on constrained hardware without sacrificing accuracy by combining synthetic data distillation, LoRA, Muon, and GPTQ; that saves memory, reduces latency, and lowers inference cost.
Summary TLDR
The paper presents an end-to-end pipeline to make small LLMs ready for edge devices. It uses a large teacher to generate task-specific synthetic data, logit-based knowledge distillation into a compact student, LoRA for parameter-efficient fine-tuning, Optuna HPO, the Muon optimizer, and GPTQ 4-bit post-training quantization. Results across 8 benchmarks show the pipeline usually outperforms naive GPTQ alone, yields about 2× memory compression (6.01GB → 2.86GB), halves per-token latency, and Muon fine-tuning reduces accuracy loss from quantization versus Adam on most tasks.
Problem Statement
Large LLMs are too big and slow for edge devices. Engineers need a reproducible workflow that: (1) creates task-aligned training data when labels are scarce, (2) fine-tunes compact models efficiently, and (3) compresses them aggressively (4-bit) while keeping task accuracy high.
Main Contribution
A full pipeline that combines Self-Instruct synthetic data, logit-based knowledge distillation, LoRA fine-tuning, Bayesian HPO, Muon optimizer, and GPTQ 4-bit post-training quantization for edge-ready LLMs.
Empirical comparison across 8 benchmarks showing the integrated pipeline outperforms GPTQ-alone in final accuracy on most tasks.
Evidence that Muon-optimized LoRA fine-tuning reduces accuracy degradation after 4-bit quantization compared to Adam.
Practical throughput and memory results demonstrating ~2× memory reduction and ~50% per-token latency reduction after w4a16 quantization on an A40 GPU with vLLM.
Key Findings
Pipeline achieves roughly 2× model memory reduction with GPTQ w4a16.
Muon fine-tuning reduces quantization-induced accuracy drop versus Adam on most tasks.
HPO consistently selects pure KL-divergence distillation (no supervised CE) on synthetic data.
Quantization in the pipeline doubles generation speed per token and increases throughput modestly.
Integrated pipeline usually beats GPTQ-only quantization on final accuracy.
Results
Model memory after quantization
Per-token latency (TPOT)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Generate a 600-sample synthetic dataset for one target task using a strong teacher and a seed prompt set.
LoRA fine-tune your 3B student with KL-distillation from a tokenizer-aligned teacher and run Optuna to tune α, rank, and learning rate.
Apply GPTQ w4a16 post-training quantization and compare accuracy and TPOT before and after; test Muon vs Adam for fine-tuning if available.
Optimization Features
Token Efficiency
- Measured TPOT and ITL reductions after quantization
Infra Optimization
- Measured on 1x Ampere A40 GPU
Model Optimization
- GPTQ 4-bit post-training quantization (w4a16)
- LoRA
- Weight quantization on linear layers
System Optimization
- Shared tokenizer between teacher and student to reduce distribution shift
Training Optimization
- LoRA
- Adam baseline comparison
- Bayesian HPO via Optuna (16 trials)
Inference Optimization
- vLLM deployment
- MarlinLinearKernel prefill noted as overhead
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Experiments use synthetic datasets of ~600 QAs per task; real-data generalization is untested.
- Only one student size (Llama3.2 3B) and specific teacher models were evaluated.
- HPO used 16 trials—may not find global optima for all tasks.
- Results measured mainly on an A40 GPU; edge-device behavior on diverse hardware is not shown.
When Not To Use
- You need a general-purpose model instead of a task-specialized one.
- You lack a strong teacher model to generate high-quality synthetic data.
- Your deployment environment cannot support GPTQ or w4a16 quantized runtimes.
Failure Modes
- HPO choosing α = 1 could transfer teacher biases and omit ground-truth signals.
- Muon may not outperform Adam for all pretraining/fine-tuning combinations (authors note mixed results in literature).
- Aggressive 4-bit quantization can still degrade accuracy for some tasks despite Muon.
Core Entities
Models
- Llama 4 Scout 109B (T1, teacher for data gen)
- Llama 3.3 70B Instruct (T2, teacher for distillation)
- Llama 3.2 3B Instruct (S1, student)
Metrics
- Accuracy
- Model size (GB)
- Throughput (tokens/s)
- TPOT (ms/token)
- ITL (ms/token)
- Validation loss
Datasets
- Synthetic Self-Instruct datasets (600 QA pairs per task, Alpaca format)
- MMLU
- ARC-e
- CommonsenseQA
- HellaSwag
- OpenBookQA
- PIQA
- SIQA
- WinoGrande
Benchmarks
- MMLU
- ARC-e
- CommonsenseQA
- HellaSwag
- OpenBookQA
- PIQA
- SIQA
- WinoGrande

