Overview
The study gives strong practical signals: 4-bit PTQ is deployable in many cases, 3-bit is risky, and 2-bit breaks multimodal tasks; results are solid across multiple methods and benchmarks.
Citations11
Evidence Strength0.80
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
4-bit quantization gives big memory and cost savings with small accuracy loss; ultra-low bits (≤2) are risky for multimodal products and need more work.
Who Should Care
Summary TLDR
This paper runs a wide empirical evaluation of low-bit quantization on LLaMA3 (8B, 70B) and LLaVA-Next-8B MLLM. It tests 9 post-training quantization (PTQ) methods and 2 LoRA fine-tuning (LoRA-FT) methods at 1–8 bits across language and visual benchmarks. Main takeaways: 4-bit quantization is practical with small loss; 3-bit often degrades more; 2-bit usually collapses for multimodal tasks; LLaMA3-70B is more robust than 8B; LoRA-FT (QLoRA/IR-QLoRA) fails to recover losses below 4 bits on LLaMA3. Code and some quantized models are released.
Problem Statement
Can current low-bit quantization methods compress LLaMA3 and LLaMA3-based multimodal models without unacceptable accuracy loss? The paper measures PTQ and LoRA-FT methods across bits (1–8) and benchmarks to map where quantization is practical and where it fails.
Main Contribution
Comprehensive evaluation of 9 PTQ and 2 LoRA-FT methods on LLaMA3-8B, LLaMA3-70B, and LLaVA-Next-8B.
Head-to-head results across language (PPL, CommonSenseQA, MMLU) and six multimodal QA benchmarks.
Key Findings
4-bit post-training quantization keeps quality close to full precision on many tasks
3-bit quantization often causes clear degradation but some methods remain usable
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ≈2% drop vs FP16 on evaluated benchmarks | FP16/16-bit | ≈-2% | CommonSenseQA / MMLU (evaluated sets) | Paper reports ~2% average drop across PPL and CommonSenseQA | Tables 1-4 |
| 3-bit degradation (method dependent) | 3-bit loss ranges <5% to >10% depending on method | 4-bit PTQ | varies by method (RTN worse; AWQ/GPTQ/SliM better) | CommonSenseQA | Paper shows RTN >10% drop; AWQ/GPTQ/SliM keep <5% vs 4-bit | Tables 3-4 |
What To Try In 7 Days
Run 4-bit PTQ (AWQ/GPTQ) on a dev copy and compare PPL and task accuracy
Measure GPU memory and latency; try SmoothQuant if memory is tight
Avoid 2-bit in multimodal pipelines; validate 3-bit on a per-task basis
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
LoRA-FT used Alpaca small instruction data, which may not reflect larger fine-tuning sets
Evaluation fixes block/group size (128) and calibration set (WikiText2 128 examples), which affects results
When Not To Use
Do not use 2-bit PTQ for LLaMA3 backbones inside MLLMs for visual QA
Do not assume LoRA-FT with small instruction data will fix quantization damage below 4 bits
Failure Modes
Complete performance collapse at ultra-low bits (2-bit) for visual tasks
LoRA fine-tuning can worsen quantized LLaMA3 under some settings

