Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can cut model storage by roughly half while keeping near-teacher translation quality, which lowers hosting cost and enables deployment on constrained hardware.
Summary TLDR
This paper compresses Qwen2-Audio-7B-Instruct for English→German and English→Chinese speech translation using a pipeline of full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation. QLoRA + distillation reduced parameters and storage by >40% and yielded the best scores. A pruning-first pipeline reached ~50% reduction while retaining 97% (Chinese) to 100% (German) of teacher quality on the in-domain ACL 60/60 test split. Code and data links are provided.
Problem Statement
Large audio-language models work well for speech translation but are too big for resource-limited deployment. The paper asks: how to cut model size and storage while keeping translation quality usable for English→German and English→Chinese?
Main Contribution
A practical pipeline combining full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation.
Empirical results on Qwen2-Audio-7B-Instruct showing >40% storage/parameter reduction with QLoRA+KD and up to 50% reduction with pruning+QLoRA while retaining most translation quality.
A set of ablations: decoder-only vs encoder+decoder pruning, iterative vs middle-layer pruning, effect of out-of-domain data size, and immediate vs post-pruning fine-tuning.
Key Findings
QLoRA (4-bit) + sequence-level knowledge distillation reduced model size by >40% while improving translation quality over the teacher baseline.
Iterative decoder-only layer pruning + QLoRA achieved ~50% compression while keeping translation quality near the teacher.
Decoder-only pruning beats pruning encoder+decoder for this task.
Iterative pruning guided by chrF/chrF++ outperforms middle-layer pruning and COMET-guided pruning on this in-domain dataset.
Pruning more than ~12–16 decoder layers degrades performance sharply even after fine-tuning.
Results
LoRA
LoRA
LoRA
LoRA
Quality retention after pruning+recovery
Inference speed effect of pruning
Who Should Care
What To Try In 7 Days
Full fine-tune Qwen2-Audio-7B-Instruct on your in-domain data for a few epochs.
Create teacher translations (sequence-level KD) and augment your training data with them.
Apply 4-bit QLoRA (nf4,double_quant) with LoRA rank=64, alpha=128 and rsLoRA to reduce storage quickly and cheaply testable on one GPU node.
Optimization Features
Infra Optimization
- LoRA
Model Optimization
- iterative decoder-only layer pruning (performance-guided)
- 4-bit quantization (nf4,double_quant) via BitsAndBytes
System Optimization
- LoRA
- oversample KD data to bias student toward teacher outputs
Training Optimization
- full-parameter fine-tuning before compression
- sequence-level knowledge distillation (teacher→student)
- LoRA
Inference Optimization
- pruning reduces depth for faster inference (≈20–40% measured)
- note: 4-bit quantization reduces storage but can reduce inference speed
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use a small in-domain corpus (784 training utterances); results may not generalize to large or very different domains.
- Work is limited to the Qwen2-Audio family; behavior on other audio-language models is untested.
- 4-bit quantization reduces storage but can hurt inference speed and requires careful BitsAndBytes configuration.
- Pruning heavily (≥16 decoder layers) causes large quality drops even after fine-tuning.
When Not To Use
- When your priority is lowest possible latency and quantization overhead would slow inference.
- When you lack a fully fine-tuned teacher model for sequence-level KD.
- When you need proven performance on datasets far from ACL 60/60 without further validation.
Failure Modes
- Over-pruning: removing too many decoder layers (≥16) collapses performance.
- Immediate iterative fine-tune after each prune can overfit small in-domain data and not outperform single post-prune FT.
- Metric choice for pruning (COMET vs chrF) can mislead layer importance and produce poor pruned models.
Core Entities
Models
- Qwen2-Audio-7B-Instruct
- Qwen2-Audio-7B
- LoRA
- BitsAndBytes
Metrics
- BLEU
- chrF
- chrF++
- COMET
Datasets
- ACL 60/60
- CoVoST2

