Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

January 14, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jacob Sander, Brian Jalaian, Venkat R. Dasari

Links

Abstract / PDF

Why It Matters For Business

You can make task-specific LLMs fit on constrained hardware without sacrificing accuracy by combining synthetic data distillation, LoRA, Muon, and GPTQ; that saves memory, reduces latency, and lowers inference cost.

Summary TLDR

The paper presents an end-to-end pipeline to make small LLMs ready for edge devices. It uses a large teacher to generate task-specific synthetic data, logit-based knowledge distillation into a compact student, LoRA for parameter-efficient fine-tuning, Optuna HPO, the Muon optimizer, and GPTQ 4-bit post-training quantization. Results across 8 benchmarks show the pipeline usually outperforms naive GPTQ alone, yields about 2× memory compression (6.01GB → 2.86GB), halves per-token latency, and Muon fine-tuning reduces accuracy loss from quantization versus Adam on most tasks.

Problem Statement

Large LLMs are too big and slow for edge devices. Engineers need a reproducible workflow that: (1) creates task-aligned training data when labels are scarce, (2) fine-tunes compact models efficiently, and (3) compresses them aggressively (4-bit) while keeping task accuracy high.

Main Contribution

A full pipeline that combines Self-Instruct synthetic data, logit-based knowledge distillation, LoRA fine-tuning, Bayesian HPO, Muon optimizer, and GPTQ 4-bit post-training quantization for edge-ready LLMs.

Empirical comparison across 8 benchmarks showing the integrated pipeline outperforms GPTQ-alone in final accuracy on most tasks.

Evidence that Muon-optimized LoRA fine-tuning reduces accuracy degradation after 4-bit quantization compared to Adam.

Practical throughput and memory results demonstrating ~2× memory reduction and ~50% per-token latency reduction after w4a16 quantization on an A40 GPU with vLLM.

Key Findings

Pipeline achieves roughly 2× model memory reduction with GPTQ w4a16.

NumbersModel size 6.01GB → 2.86GB (Table 5)

Muon fine-tuning reduces quantization-induced accuracy drop versus Adam on most tasks.

NumbersMuon loses less on quantization in 6 of 8 benchmarks; ARC-e: Adam drop 3.16% vs Muon 0.55% (Table 4)

HPO consistently selects pure KL-divergence distillation (no supervised CE) on synthetic data.

NumbersOptuna found distillation weight α = 1 across tasks (Table 6)

Quantization in the pipeline doubles generation speed per token and increases throughput modestly.

NumbersTPOT 17.49ms → 8.82ms (≈50% latency reduction); throughput 1387.64 → 1722.82 tok/s (Table 5)

Integrated pipeline usually beats GPTQ-only quantization on final accuracy.

NumbersPipeline (Adam+Muon variants) outperforms GPTQ alone on 6 of 8 benchmarks (Figure 3 / Table 3)

Results

Model memory after quantization

Value2.86 GB (post-quant)

Baseline6.01 GB (pre-quant)

Per-token latency (TPOT)

Value8.82 ms/token (post-quant)

Baseline17.49 ms/token (pre-quant)

Accuracy

ValueMuon: typically 0.0–0.02 absolute drop; Adam: up to 0.0336 drop

BaselinePer-task LoRA fine-tuned accuracy

Accuracy

ValuePipeline (Adam+Muon variants) wins 6/8 benchmarks

BaselineGPTQ-alone accuracy

Who Should Care

What To Try In 7 Days

Generate a 600-sample synthetic dataset for one target task using a strong teacher and a seed prompt set.

LoRA fine-tune your 3B student with KL-distillation from a tokenizer-aligned teacher and run Optuna to tune α, rank, and learning rate.

Apply GPTQ w4a16 post-training quantization and compare accuracy and TPOT before and after; test Muon vs Adam for fine-tuning if available.

Optimization Features

Token Efficiency

  • Measured TPOT and ITL reductions after quantization

Infra Optimization

  • Measured on 1x Ampere A40 GPU

Model Optimization

  • GPTQ 4-bit post-training quantization (w4a16)
  • LoRA
  • Weight quantization on linear layers

System Optimization

  • Shared tokenizer between teacher and student to reduce distribution shift

Training Optimization

  • LoRA
  • Adam baseline comparison
  • Bayesian HPO via Optuna (16 trials)

Inference Optimization

  • vLLM deployment
  • MarlinLinearKernel prefill noted as overhead

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments use synthetic datasets of ~600 QAs per task; real-data generalization is untested.
  • Only one student size (Llama3.2 3B) and specific teacher models were evaluated.
  • HPO used 16 trials—may not find global optima for all tasks.
  • Results measured mainly on an A40 GPU; edge-device behavior on diverse hardware is not shown.

When Not To Use

  • You need a general-purpose model instead of a task-specialized one.
  • You lack a strong teacher model to generate high-quality synthetic data.
  • Your deployment environment cannot support GPTQ or w4a16 quantized runtimes.

Failure Modes

  • HPO choosing α = 1 could transfer teacher biases and omit ground-truth signals.
  • Muon may not outperform Adam for all pretraining/fine-tuning combinations (authors note mixed results in literature).
  • Aggressive 4-bit quantization can still degrade accuracy for some tasks despite Muon.

Core Entities

Models

  • Llama 4 Scout 109B (T1, teacher for data gen)
  • Llama 3.3 70B Instruct (T2, teacher for distillation)
  • Llama 3.2 3B Instruct (S1, student)

Metrics

  • Accuracy
  • Model size (GB)
  • Throughput (tokens/s)
  • TPOT (ms/token)
  • ITL (ms/token)
  • Validation loss

Datasets

  • Synthetic Self-Instruct datasets (600 QA pairs per task, Alpaca format)
  • MMLU
  • ARC-e
  • CommonsenseQA
  • HellaSwag
  • OpenBookQA
  • PIQA
  • SIQA
  • WinoGrande

Benchmarks

  • MMLU
  • ARC-e
  • CommonsenseQA
  • HellaSwag
  • OpenBookQA
  • PIQA
  • SIQA
  • WinoGrande