Learned adapter pruning replaces grid search for cross-lingual LoRA merging

January 10, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Besher Hassan, Xiuying Chen

Links

Abstract / PDF

Why It Matters For Business

GRASP LoRA cuts tuning runs and labeled dev needs by learning a pruning rate online, lowering compute and development cost while often improving quality on low-resource language transfer.

Summary TLDR

GRASP LoRA learns a single global prune ratio for merged LoRA adapters using a lightweight GRPO controller that probes candidate sparsities on a tiny micro dev slice. This replaces expensive grid search with one controller pass plus one final fine-tune. On English→Arabic/Chinese transfer (XL-Sum summarization and MLQA QA) it finds fractional prune rates, improves generation/QA metrics over strong baselines, and cuts end-to-end runtime 3.9×–7.45×. It is robust to very small micro devs but was validated only on one backbone and two tasks.

Problem Statement

When merging LoRA adapters for cross-lingual transfer, people pick a global prune ratio by grid search. Grid search needs many full training runs and large dev sets, misses fractional optima, and can be brittle. We need a cheap, stable way to learn the right overall sparsity during training.

Main Contribution

GRASP LoRA: treat the global prune ratio as a learnable control variable and optimize it online with a GRPO controller.

A training pipeline that interleaves controller probing on a tiny micro dev slice with normal fine tuning, then performs one final prune+fine-tune at the chosen ratio.

Empirical gains: improved XL-Sum and MLQA metrics on Arabic and Chinese and a 3.90×–7.45× reduction in total runtime versus an 8-point grid search.

Key Findings

GRASP LoRA improves summarization metrics over best grid-search baseline on XL-Sum.

NumbersArabic: +0.88 BERT-F1, +1.75 BLEU-4, +2.13 ROUGE-L; Chinese: +1.62 BERT-F1, +1.73 BLEU-4, +1.45 ROUGE-L

GRASP LoRA improves extractive QA over best grid baseline on MLQA.

NumbersArabic: +0.56 BERT-F1, +2.67 EM, +2.22 token F1; Chinese: +1.98 BERT-F1, +1.50 EM, +0.67 token F1

Controller-run + final run cuts end-to-end runtime vs. 8-point grid search.

NumbersSpeedups 3.90× to 7.45× across tasks and languages

Micro dev size can be very small without breaking result stability.

NumbersLearned p⋆ stays 64%–66%; F1 and ROUGE-L vary ≤0.3 and ≤1.2 points for m∈{4,8,16,32}

Controller regularizers stabilize pruning and avoid harmful over-pruning.

NumbersRemoving entropy bonus shifts p⋆ ≈78–79% and reduces F1 and ROUGE-L (e.g., F1 75.32→73.81)

Results

XL-Sum Arabic BERTScore-F1

Value75.84 ± 0.13

BaselineBest grid searched merge+prune 74.96 ± 0.25

XL-Sum Chinese BERTScore-F1

Value33.62 ± 0.16

BaselineBest grid searched merge+prune 32.00 ± 0.36

MLQA Arabic Exact Match (EM)

Value41.00 ± 2.08

BaselineBest grid searched merge+prune 38.33 ± 0.58

MLQA Chinese BERTScore-F1

Value71.28 ± 0.29

BaselineBest grid searched merge+prune 69.30 ± 1.37

Who Should Care

What To Try In 7 Days

Run GRASP LoRA with a 16-example micro dev slice and compare end-to-end runtime vs your current grid search.

Apply GRASP LoRA to one existing English→target adapter pair and measure target dev metrics after the single final prune+fine-tune.

Ablate entropy/anchor settings to find a stable commit schedule for your data.

Optimization Features

Infra Optimization

  • runtime savings 3.9×–7.45× on a single A100 setup

Model Optimization

  • adapter-level magnitude pruning
  • importance-based per-tensor top-k masking

System Optimization

  • replaces N grid-search full runs with one controller pass plus one final run

Training Optimization

  • learned global prune ratio optimized online
  • controller probes without gradient updates during evaluation

Inference Optimization

  • single final sparse adapter reduces served parameter count (implied)

Reproducibility

Code Urls

  • github: GRASP LoRA (paper indicates a GitHub repository reference)

Data Urls

  • XL-Sum (public)
  • MLQA (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single backbone (Llama 3 8B) and one hardware setup only.
  • Only two tasks (XL-Sum, MLQA) and two target languages (Arabic, Chinese).
  • Micro dev rewards depend on a tiny fixed slice; out-of-distribution micro devs may mislead the controller.
  • No deployment metrics reported (latency, memory, energy).
  • Dialectal and broader language family behaviors untested.

When Not To Use

  • You can afford a full grid search and want explicit control over discrete pruning points.
  • You need layer-wise or per-module pruning policies (GRASP learns a single global ratio).
  • You must certify worst-case behavior under extreme distribution shift.

Failure Modes

  • Over-pruning if entropy bonus or anchoring are disabled (controller collapses to high p).
  • Mask thrashing if max-commit ∆max or commit rules are misconfigured.
  • Micro dev not representative leads to a poor chosen p and degraded target performance.
  • Optimizer state clearing for newly pruned entries may destabilize training if commits are frequent.

Core Entities

Models

  • Llama 3 8B
  • LoRA

Metrics

  • BERTScore-F1
  • BLEU-4
  • ROUGE-L
  • Exact Match
  • Token F1
  • ROUGE-1
  • ROUGE-2
  • chrF

Datasets

  • XL-Sum
  • MLQA

Context Entities

Models

  • SparseGPT
  • LLMPruner
  • LoRA
  • AMC
  • HAQ

Datasets

  • mBART/mT5 style multilingual pretraining (cited)