How low-bit quantization changes LLaMA3 and a LLaVA MLLM

April 22, 20247 min

Overview

Decision SnapshotReady For Pilot

The study gives strong practical signals: 4-bit PTQ is deployable in many cases, 3-bit is risky, and 2-bit breaks multimodal tasks; results are solid across multiple methods and benchmarks.

Citations11

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 30%

Authors

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

Links

Abstract / PDF / Code

Why It Matters For Business

4-bit quantization gives big memory and cost savings with small accuracy loss; ultra-low bits (≤2) are risky for multimodal products and need more work.

Who Should Care

Summary TLDR

This paper runs a wide empirical evaluation of low-bit quantization on LLaMA3 (8B, 70B) and LLaVA-Next-8B MLLM. It tests 9 post-training quantization (PTQ) methods and 2 LoRA fine-tuning (LoRA-FT) methods at 1–8 bits across language and visual benchmarks. Main takeaways: 4-bit quantization is practical with small loss; 3-bit often degrades more; 2-bit usually collapses for multimodal tasks; LLaMA3-70B is more robust than 8B; LoRA-FT (QLoRA/IR-QLoRA) fails to recover losses below 4 bits on LLaMA3. Code and some quantized models are released.

Problem Statement

Can current low-bit quantization methods compress LLaMA3 and LLaMA3-based multimodal models without unacceptable accuracy loss? The paper measures PTQ and LoRA-FT methods across bits (1–8) and benchmarks to map where quantization is practical and where it fails.

Main Contribution

Comprehensive evaluation of 9 PTQ and 2 LoRA-FT methods on LLaMA3-8B, LLaMA3-70B, and LLaVA-Next-8B.

Head-to-head results across language (PPL, CommonSenseQA, MMLU) and six multimodal QA benchmarks.

Key Findings

4-bit post-training quantization keeps quality close to full precision on many tasks

Numbers≈2% average drop vs. FP16 on evaluated benchmarks

Practical UseUse 4-bit PTQ (e.g., GPTQ/AWQ/SliM-LLM) for production-like savings with small accuracy loss.

Evidence RefTables 1-4; paper text

3-bit quantization often causes clear degradation but some methods remain usable

Numbers3-bit loss ranges from <5% (AWQ/GPTQ/SliM) to >10% (RTN) on CommonSenseQA

Practical UseIf you need 3-bit, prefer advanced PTQ (AWQ/GPTQ/SliM-LLM); expect task-dependent drops and validate per task.

Evidence RefTables 1-4; paper text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy≈2% drop vs FP16 on evaluated benchmarksFP16/16-bit≈-2%CommonSenseQA / MMLU (evaluated sets)Paper reports ~2% average drop across PPL and CommonSenseQATables 1-4
3-bit degradation (method dependent)3-bit loss ranges <5% to >10% depending on method4-bit PTQvaries by method (RTN worse; AWQ/GPTQ/SliM better)CommonSenseQAPaper shows RTN >10% drop; AWQ/GPTQ/SliM keep <5% vs 4-bitTables 3-4

What To Try In 7 Days

Run 4-bit PTQ (AWQ/GPTQ) on a dev copy and compare PPL and task accuracy

Measure GPU memory and latency; try SmoothQuant if memory is tight

Avoid 2-bit in multimodal pipelines; validate 3-bit on a per-task basis

Optimization Features

Infra Optimization
tests on 8× NVIDIA A800 (80GB)
Model Optimization
weight-only quantizationgrouped mixed-precisionbinarization/residual quantization
System Optimization
GPU memory/time trade-offs reportedLoRA
Training Optimization
LoRAlarge-batch fine-tuning used in some ultra-low methods
Inference Optimization
block-wise quantization kernels (GPTQ/AWQ/OmniQuant)activation-weight shifting (SmoothQuant)

Reproducibility

Risks & Boundaries

Limitations

LoRA-FT used Alpaca small instruction data, which may not reflect larger fine-tuning sets

Evaluation fixes block/group size (128) and calibration set (WikiText2 128 examples), which affects results

When Not To Use

Do not use 2-bit PTQ for LLaMA3 backbones inside MLLMs for visual QA

Do not assume LoRA-FT with small instruction data will fix quantization damage below 4 bits

Failure Modes

Complete performance collapse at ultra-low bits (2-bit) for visual tasks

LoRA fine-tuning can worsen quantized LLaMA3 under some settings

Core Entities

Models

LLaMA3-8BLLaMA3-70BLLaVA-Next-8BLLaMA-7BLLaMA2-7B

Metrics

Perplexity (PPL)AccuracyTokens/secGPU memory (GB)Quantization time

Datasets

WikiText2C4PTBPIQAARC-eARC-cHellaSwagWinograndeMMLUAI2DChartQADocVQAMMEMMBench (English)

Benchmarks

MMLUCommonSenseQA (PIQA/ARC/HellaSwag/Wino)Multimodal QA (AI2D/ChartQA/DocVQA/MMBench/MME)