Recover lost accuracy in corrupted small LMs by training tiny LoRA adapters with synthetic data and logit distillation

October 6, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Devleena Das, Rajeev Patwari, Ashish Sirasao

Links

Abstract / PDF

Why It Matters For Business

Deployments can silently corrupt weights during conversion or serialization. Recover-LoRA offers a low-cost way to restore accuracy without labeled data or full retraining, saving time and lowering risk for edge and on-device models.

Summary TLDR

Recover-LoRA trains small LoRA adapters using synthetic data and logit distillation to restore accuracy in functionally degraded small language models (SLMs). On four SLMs, Recover-LoRA recovered an average of +5–17% accuracy on three models while using far less data and far fewer trainable parameters than full-model distillation or supervised finetuning. It can fail on some architectures, so adapter placement and matching tokenizer/data generation matter.

Problem Statement

Deployment conversions or bad serialization can silently corrupt model weights and drop task accuracy. Full retraining or labeled data may be unavailable. How can we cheaply restore accuracy when weights are degraded, without labeled data?

Main Contribution

Recover-LoRA: a lightweight, data-free method that trains only LoRA adapters to align a degraded model to its full-precision reference via logit distillation.

Empirical study on four small models (1B–2B) across seven downstream tasks showing Recover-LoRA often recovers accuracy while using much less data and fewer trainable parameters than alternatives.

Practical guidance: synthetic hybrid sampling, adapter placement choices (e.g., K/V vs attention+MLP), and trade-offs for deployment.

Key Findings

Recover-LoRA recovered non-zero accuracy on three of four tested SLMs.

NumbersAR% = +17.24 (AMD-OLMO-SFT 1B), +13.38 (Llama3.2 1B), +4.95 (DeepSeek-R1 1.5B)

Recover-LoRA failed or reduced accuracy on at least one model architecture (Gemma2 2B).

NumbersAR% = -7.45 (Gemma2 2B)

Recover-LoRA uses much less synthetic data than supervised finetuning while staying parameter-efficient.

NumbersSynthetic: 90k–120k samples vs SFT labeled: 3M samples

Distilling and updating all model parameters (LLM QAT* adaptation) worsened degradation in experiments.

NumbersLLM QAT* AR% example: -10.34 (AMD-OLMO-SFT 1B)

The simulated corruption magnitude (L2 norm) is small but measurable between original and perturbed weights.

NumbersL2 norms: 35.94–52.97 across models (Table 1)

Results

LoRA

ValueAMD-OLMO-SFT 1B: +17.24%

BaselineDegraded vs pretrained

LoRA

ValueLlama3.2 1B: +13.38%

BaselineDegraded vs pretrained

LoRA

ValueGemma2 2B: -7.45%

BaselineDegraded vs pretrained

LoRA

ValueDeepSeek-R1-Distill-Qwen 1.5B: +4.95%

BaselineDegraded vs pretrained

Average AR% (LLM QAT* baseline)

ValueExamples: -10.34% (AMD-OLMO-SFT), -14.75% (Llama3.2)

BaselineDegraded vs pretrained

Synthetic data used

ValueRecover-LoRA: 90k (AMD) or 120k samples (others); SFT LoRA: 3M labeled samples

Who Should Care

What To Try In 7 Days

Reproduce: generate 100k synthetic samples from the original tokenizer and run Recover-LoRA on your degraded SLM.

Adapter search: test LoRA on K/V layers first, then try attention+MLP if recovery is limited.

Baseline check: compare AR% versus a small supervised LoRA run (if labeled data exists) to validate synthetic-data approach.

Optimization Features

Model Optimization

  • LoRA

Training Optimization

  • logit distillation with synthetic data
  • hybrid sampling (first 3–5 tokens greedy, rest stochastic)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Works best on tested small models (1B–2B); larger models (7B+) not evaluated.
  • Model-dependent: adapter placement (K/V vs attention+MLP) materially affects success.
  • Tested corruption is simulated improper serialization; other degradation sources (heavy quantization, pruning) need more study.
  • Requires synthetic data generated with a matching tokenizer/vocabulary; mismatch hurts results.

When Not To Use

  • If the model architecture or tokenizer prevents matching synthetic-data generation.
  • If degradation is structural (missing layers) rather than small weight corruption.
  • If supervised labeled data and resources for full finetuning are available and preferred.

Failure Modes

  • Negative AR% (method can worsen performance) as observed for Gemma2 2B.
  • Overfitting when updating all parameters (LLM QAT*), leading to worse accuracy.
  • Sensitivity to synthetic data quality and quantity; too few or mismatched samples reduce recovery.

Core Entities

Models

  • SFT
  • Llama3.2 1B
  • Gemma2 2B
  • DeepSeek-R1-Distill-Qwen 1.5B

Metrics

  • Accuracy
  • L2 norm difference (weight perturbation)

Datasets

  • HellaSwag
  • MMLU (three subsets: Philosophy, Management, Astronomy)
  • ARC Challenge
  • WinoGrande
  • PiQA
  • OpenBookQA
  • BoolQ