Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.7
Citation Count
11
Why It Matters For Business
4-bit quantization gives big memory and cost savings with small accuracy loss; ultra-low bits (≤2) are risky for multimodal products and need more work.
Summary TLDR
This paper runs a wide empirical evaluation of low-bit quantization on LLaMA3 (8B, 70B) and LLaVA-Next-8B MLLM. It tests 9 post-training quantization (PTQ) methods and 2 LoRA fine-tuning (LoRA-FT) methods at 1–8 bits across language and visual benchmarks. Main takeaways: 4-bit quantization is practical with small loss; 3-bit often degrades more; 2-bit usually collapses for multimodal tasks; LLaMA3-70B is more robust than 8B; LoRA-FT (QLoRA/IR-QLoRA) fails to recover losses below 4 bits on LLaMA3. Code and some quantized models are released.
Problem Statement
Can current low-bit quantization methods compress LLaMA3 and LLaMA3-based multimodal models without unacceptable accuracy loss? The paper measures PTQ and LoRA-FT methods across bits (1–8) and benchmarks to map where quantization is practical and where it fails.
Main Contribution
Comprehensive evaluation of 9 PTQ and 2 LoRA-FT methods on LLaMA3-8B, LLaMA3-70B, and LLaVA-Next-8B.
Head-to-head results across language (PPL, CommonSenseQA, MMLU) and six multimodal QA benchmarks.
Practical measurement of GPU memory, quantization time, and inference latency for selected methods.
Key Findings
4-bit post-training quantization keeps quality close to full precision on many tasks
3-bit quantization often causes clear degradation but some methods remain usable
2-bit (and lower) frequently collapses for multimodal QA
LLaMA3-70B is more robust to low-bit quantization than 8B
LoRA fine-tuning on Alpaca fails to recover and can worsen LLaMA3 quantized below 4 bits
Some PTQ methods are faster and more memory efficient in practice
Results
Accuracy
3-bit degradation (method dependent)
Accuracy
MLLM 2-bit collapse
Memory / latency examples
Who Should Care
What To Try In 7 Days
Run 4-bit PTQ (AWQ/GPTQ) on a dev copy and compare PPL and task accuracy
Measure GPU memory and latency; try SmoothQuant if memory is tight
Avoid 2-bit in multimodal pipelines; validate 3-bit on a per-task basis
Optimization Features
Infra Optimization
- tests on 8× NVIDIA A800 (80GB)
Model Optimization
- weight-only quantization
- grouped mixed-precision
- binarization/residual quantization
System Optimization
- GPU memory/time trade-offs reported
- LoRA
Training Optimization
- LoRA
- large-batch fine-tuning used in some ultra-low methods
Inference Optimization
- block-wise quantization kernels (GPTQ/AWQ/OmniQuant)
- activation-weight shifting (SmoothQuant)
Reproducibility
Code Urls
- https://github.com/IST-DASLab/gptq
- https://github.com/mit-han-lab/llm-awq
- https://github.com/mit-han-lab/smoothquant
- https://github.com/Cornell-RelaxML/QuIP
- https://github.com/Aaronhuang-778/SliM-LLM
- https://github.com/hahnyuan/PB-LLM
- https://github.com/Aaronhuang-778/BiLLM
- https://github.com/artidoro/qlora
- https://github.com/htqin/IR-QLoRA
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LoRA-FT used Alpaca small instruction data, which may not reflect larger fine-tuning sets
- Evaluation fixes block/group size (128) and calibration set (WikiText2 128 examples), which affects results
- Some latency numbers depend on method-provided kernels and may vary by deployment
When Not To Use
- Do not use 2-bit PTQ for LLaMA3 backbones inside MLLMs for visual QA
- Do not assume LoRA-FT with small instruction data will fix quantization damage below 4 bits
Failure Modes
- Complete performance collapse at ultra-low bits (2-bit) for visual tasks
- LoRA fine-tuning can worsen quantized LLaMA3 under some settings
- Unexpected or repetitive character outputs in 2-bit MLLM runs
Core Entities
Models
- LLaMA3-8B
- LLaMA3-70B
- LLaVA-Next-8B
- LLaMA-7B
- LLaMA2-7B
Metrics
- Perplexity (PPL)
- Accuracy
- Tokens/sec
- GPU memory (GB)
- Quantization time
Datasets
- WikiText2
- C4
- PTB
- PIQA
- ARC-e
- ARC-c
- HellaSwag
- Winogrande
- MMLU
- AI2D
- ChartQA
- DocVQA
- MME
- MMBench (English)
Benchmarks
- MMLU
- CommonSenseQA (PIQA/ARC/HellaSwag/Wino)
- Multimodal QA (AI2D/ChartQA/DocVQA/MMBench/MME)

