Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

May 26, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

0

Authors

Yasmin Moslem

Links

Abstract / PDF

Why It Matters For Business

You can cut model storage by roughly half while keeping near-teacher translation quality, which lowers hosting cost and enables deployment on constrained hardware.

Summary TLDR

This paper compresses Qwen2-Audio-7B-Instruct for English→German and English→Chinese speech translation using a pipeline of full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation. QLoRA + distillation reduced parameters and storage by >40% and yielded the best scores. A pruning-first pipeline reached ~50% reduction while retaining 97% (Chinese) to 100% (German) of teacher quality on the in-domain ACL 60/60 test split. Code and data links are provided.

Problem Statement

Large audio-language models work well for speech translation but are too big for resource-limited deployment. The paper asks: how to cut model size and storage while keeping translation quality usable for English→German and English→Chinese?

Main Contribution

A practical pipeline combining full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation.

Empirical results on Qwen2-Audio-7B-Instruct showing >40% storage/parameter reduction with QLoRA+KD and up to 50% reduction with pruning+QLoRA while retaining most translation quality.

A set of ablations: decoder-only vs encoder+decoder pruning, iterative vs middle-layer pruning, effect of out-of-domain data size, and immediate vs post-pruning fine-tuning.

Key Findings

QLoRA (4-bit) + sequence-level knowledge distillation reduced model size by >40% while improving translation quality over the teacher baseline.

NumbersParams 8.40B→4.95B; Storage 16.79GB→9.64GB; EN-DE BLEU 39.28→43.25 (Table 1)

Iterative decoder-only layer pruning + QLoRA achieved ~50% compression while keeping translation quality near the teacher.

NumbersParams 8.40B→4.12B; Storage 16.79GB→8.65GB; quality retained ≈100% (EN-DE) and ≈97% (EN-ZH)

Decoder-only pruning beats pruning encoder+decoder for this task.

NumbersEN-DE BLEU 30.81 (decoder-only) vs 26.44 (encoder+decoder); params 6.78B vs 6.62B (Table 3)

Iterative pruning guided by chrF/chrF++ outperforms middle-layer pruning and COMET-guided pruning on this in-domain dataset.

NumbersEN-ZH BLEU middle=1.3 vs iterative=42.52 for 8-layer prune (Table 5)

Pruning more than ~12–16 decoder layers degrades performance sharply even after fine-tuning.

NumbersPrune 16 layers: EN-DE BLEU drop to 0.06 before recovery; after recovery still worse than shallower pruning (Table 6)

Results

LoRA

Value43.25

BaselineFull FT 39.28

LoRA

Value59.60

BaselineFull FT 58.54

LoRA

ValueParams 8.40B→4.95B; Storage 16.79GB→9.64GB

BaselineBaseline Qwen2-Audio-7B

LoRA

ValueParams 8.40B→4.12B; Storage 16.79GB→8.65GB

BaselineBaseline Qwen2-Audio-7B

Quality retention after pruning+recovery

ValueEN-DE ≈100% ; EN-ZH ≈97%

BaselineFully fine-tuned teacher

Inference speed effect of pruning

Value≈20% faster (8-layer prune); ≈40% faster (16-layer prune)

BaselineUnpruned baseline after FT

Who Should Care

What To Try In 7 Days

Full fine-tune Qwen2-Audio-7B-Instruct on your in-domain data for a few epochs.

Create teacher translations (sequence-level KD) and augment your training data with them.

Apply 4-bit QLoRA (nf4,double_quant) with LoRA rank=64, alpha=128 and rsLoRA to reduce storage quickly and cheaply testable on one GPU node.

Optimization Features

Infra Optimization

  • LoRA

Model Optimization

  • iterative decoder-only layer pruning (performance-guided)
  • 4-bit quantization (nf4,double_quant) via BitsAndBytes

System Optimization

  • LoRA
  • oversample KD data to bias student toward teacher outputs

Training Optimization

  • full-parameter fine-tuning before compression
  • sequence-level knowledge distillation (teacher→student)
  • LoRA

Inference Optimization

  • pruning reduces depth for faster inference (≈20–40% measured)
  • note: 4-bit quantization reduces storage but can reduce inference speed

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use a small in-domain corpus (784 training utterances); results may not generalize to large or very different domains.
  • Work is limited to the Qwen2-Audio family; behavior on other audio-language models is untested.
  • 4-bit quantization reduces storage but can hurt inference speed and requires careful BitsAndBytes configuration.
  • Pruning heavily (≥16 decoder layers) causes large quality drops even after fine-tuning.

When Not To Use

  • When your priority is lowest possible latency and quantization overhead would slow inference.
  • When you lack a fully fine-tuned teacher model for sequence-level KD.
  • When you need proven performance on datasets far from ACL 60/60 without further validation.

Failure Modes

  • Over-pruning: removing too many decoder layers (≥16) collapses performance.
  • Immediate iterative fine-tune after each prune can overfit small in-domain data and not outperform single post-prune FT.
  • Metric choice for pruning (COMET vs chrF) can mislead layer importance and produce poor pruned models.

Core Entities

Models

  • Qwen2-Audio-7B-Instruct
  • Qwen2-Audio-7B
  • LoRA
  • BitsAndBytes

Metrics

  • BLEU
  • chrF
  • chrF++
  • COMET

Datasets

  • ACL 60/60
  • CoVoST2