Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Overview

Decision SnapshotReady For Pilot

The pipeline uses established techniques (pruning, LoRA, QLoRA, KD) and shows clear storage and quality trade-offs on a constrained in-domain dataset; broader generalization needs extra testing.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Yasmin Moslem

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut model storage by roughly half while keeping near-teacher translation quality, which lowers hosting cost and enables deployment on constrained hardware.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

This paper compresses Qwen2-Audio-7B-Instruct for English→German and English→Chinese speech translation using a pipeline of full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation. QLoRA + distillation reduced parameters and storage by >40% and yielded the best scores. A pruning-first pipeline reached ~50% reduction while retaining 97% (Chinese) to 100% (German) of teacher quality on the in-domain ACL 60/60 test split. Code and data links are provided.

Problem Statement

Large audio-language models work well for speech translation but are too big for resource-limited deployment. The paper asks: how to cut model size and storage while keeping translation quality usable for English→German and English→Chinese?

Main Contribution

A practical pipeline combining full fine-tuning, iterative decoder-only layer pruning, 4-bit quantization with QLoRA, and sequence-level knowledge distillation.

Empirical results on Qwen2-Audio-7B-Instruct showing >40% storage/parameter reduction with QLoRA+KD and up to 50% reduction with pruning+QLoRA while retaining most translation quality.

Key Findings

QLoRA (4-bit) + sequence-level knowledge distillation reduced model size by >40% while improving translation quality over the teacher baseline.

NumbersParams 8.40B→4.95B; Storage 16.79GB→9.64GB; EN-DE BLEU 39.28→43.25 (Table 1)

Practical UseIf you need smaller storage and strong quality, fine-tune Qwen2-Audio, quantize to 4-bit (nf4,double_quant) and apply QLoRA plus teacher-generated KD data.

Evidence RefTable 1; Section 3.2

Iterative decoder-only layer pruning + QLoRA achieved ~50% compression while keeping translation quality near the teacher.

NumbersParams 8.40B→4.12B; Storage 16.79GB→8.65GB; quality retained ≈100% (EN-DE) and ≈97% (EN-ZH)

Practical UseTo maximize compression, prune decoder layers iteratively, then recover with teacher KD and 4-bit QLoRA fine-tuning.

Evidence RefTable 2; Section 3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LoRA	43.25	Full FT 39.28	+3.97	ACL 60/60 test (EN-DE)	QLoRA + knowledge distillation outperforms full fine-tuning on EN-DE	Table 1
LoRA	59.60	Full FT 58.54	+1.06	ACL 60/60 test (EN-ZH)	QLoRA + knowledge distillation slightly improves over full fine-tuning on EN-ZH	Table 1

What To Try In 7 Days

Full fine-tune Qwen2-Audio-7B-Instruct on your in-domain data for a few epochs.

Create teacher translations (sequence-level KD) and augment your training data with them.

Apply 4-bit QLoRA (nf4,double_quant) with LoRA rank=64, alpha=128 and rsLoRA to reduce storage quickly and cheaply testable on one GPU node.

Optimization Features

Infra Optimization

LoRA

Model Optimization

iterative decoder-only layer pruning (performance-guided)4-bit quantization (nf4,double_quant) via BitsAndBytes

System Optimization

LoRAoversample KD data to bias student toward teacher outputs

Training Optimization

full-parameter fine-tuning before compressionsequence-level knowledge distillation (teacher→student)LoRA

Inference Optimization

pruning reduces depth for faster inference (≈20–40% measured)note: 4-bit quantization reduces storage but can reduce inference speed

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ymoslem/Model-Compression https://hf.co/Qwen/Qwen2-Audio-7B-Instruct

Data URLs

https://hf.co/datasets/ymoslem/acl-6060 https://github.com/facebookresearch/covost

Risks & Boundaries

Limitations

Experiments use a small in-domain corpus (784 training utterances); results may not generalize to large or very different domains.

Work is limited to the Qwen2-Audio family; behavior on other audio-language models is untested.

When Not To Use

When your priority is lowest possible latency and quantization overhead would slow inference.

When you lack a fully fine-tuned teacher model for sequence-level KD.

Failure Modes

Over-pruning: removing too many decoder layers (≥16) collapses performance.

Immediate iterative fine-tune after each prune can overfit small in-domain data and not outperform single post-prune FT.

Core Entities

Models

Qwen2-Audio-7B-InstructQwen2-Audio-7BLoRABitsAndBytes

Metrics

BLEUchrFchrF++COMET

Datasets

ACL 60/60CoVoST2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

QLoRA (4-bit) + sequence-level knowledge distillation reduced model size by >40% while improving translation quality over the teacher baseline.

Iterative decoder-only layer pruning + QLoRA achieved ~50% compression while keeping translation quality near the teacher.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Key finding

Train one model to act like many agents: Chain-of-Agents (CoA) and Agent Foundation Models (AFM)

Key finding