Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

May 26, 20257 min

Overview

Decision SnapshotReady For Pilot

The pipeline uses established techniques (pruning, LoRA, QLoRA, KD) and shows clear storage and quality trade-offs on a constrained in-domain dataset; broader generalization needs extra testing.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Yasmin Moslem

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut model storage by roughly half while keeping near-teacher translation quality, which lowers hosting cost and enables deployment on constrained hardware.

Who Should Care

Summary TLDR

This paper compresses Qwen2-Audio-7B-Instruct for English→German and English→Chinese speech translation using a pipeline of full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation. QLoRA + distillation reduced parameters and storage by >40% and yielded the best scores. A pruning-first pipeline reached ~50% reduction while retaining 97% (Chinese) to 100% (German) of teacher quality on the in-domain ACL 60/60 test split. Code and data links are provided.

Problem Statement

Large audio-language models work well for speech translation but are too big for resource-limited deployment. The paper asks: how to cut model size and storage while keeping translation quality usable for English→German and English→Chinese?

Main Contribution

A practical pipeline combining full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation.

Empirical results on Qwen2-Audio-7B-Instruct showing >40% storage/parameter reduction with QLoRA+KD and up to 50% reduction with pruning+QLoRA while retaining most translation quality.

Key Findings

QLoRA (4-bit) + sequence-level knowledge distillation reduced model size by >40% while improving translation quality over the teacher baseline.

NumbersParams 8.40B4.95B; Storage 16.79GB→9.64GB; EN-DE BLEU 39.2843.25 (Table 1)

Practical UseIf you need smaller storage and strong quality, fine-tune Qwen2-Audio, quantize to 4-bit (nf4,double_quant) and apply QLoRA plus teacher-generated KD data.

Evidence RefTable 1; Section 3.2

Iterative decoder-only layer pruning + QLoRA achieved ~50% compression while keeping translation quality near the teacher.

NumbersParams 8.40B4.12B; Storage 16.79GB→8.65GB; quality retained ≈100% (EN-DE) and ≈97% (EN-ZH)

Practical UseTo maximize compression, prune decoder layers iteratively, then recover with teacher KD and 4-bit QLoRA fine-tuning.

Evidence RefTable 2; Section 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LoRA43.25Full FT 39.28+3.97ACL 60/60 test (EN-DE)QLoRA + knowledge distillation outperforms full fine-tuning on EN-DETable 1
LoRA59.60Full FT 58.54+1.06ACL 60/60 test (EN-ZH)QLoRA + knowledge distillation slightly improves over full fine-tuning on EN-ZHTable 1

What To Try In 7 Days

Full fine-tune Qwen2-Audio-7B-Instruct on your in-domain data for a few epochs.

Create teacher translations (sequence-level KD) and augment your training data with them.

Apply 4-bit QLoRA (nf4,double_quant) with LoRA rank=64, alpha=128 and rsLoRA to reduce storage quickly and cheaply testable on one GPU node.

Optimization Features

Infra Optimization
LoRA
Model Optimization
iterative decoder-only layer pruning (performance-guided)4-bit quantization (nf4,double_quant) via BitsAndBytes
System Optimization
LoRAoversample KD data to bias student toward teacher outputs
Training Optimization
full-parameter fine-tuning before compressionsequence-level knowledge distillation (teacher→student)LoRA
Inference Optimization
pruning reduces depth for faster inference (≈20–40% measured)note: 4-bit quantization reduces storage but can reduce inference speed

Reproducibility

Risks & Boundaries

Limitations

Experiments use a small in-domain corpus (784 training utterances); results may not generalize to large or very different domains.

Work is limited to the Qwen2-Audio family; behavior on other audio-language models is untested.

When Not To Use

When your priority is lowest possible latency and quantization overhead would slow inference.

When you lack a fully fine-tuned teacher model for sequence-level KD.

Failure Modes

Over-pruning: removing too many decoder layers (≥16) collapses performance.

Immediate iterative fine-tune after each prune can overfit small in-domain data and not outperform single post-prune FT.

Core Entities

Models

Qwen2-Audio-7B-InstructQwen2-Audio-7BLoRABitsAndBytes

Metrics

BLEUchrFchrF++COMET

Datasets

ACL 60/60CoVoST2