How low-bit quantization changes LLaMA3 and a LLaVA MLLM

Overview

Decision SnapshotReady For Pilot

The study gives strong practical signals: 4-bit PTQ is deployable in many cases, 3-bit is risky, and 2-bit breaks multimodal tasks; results are solid across multiple methods and benchmarks.

Citations11

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 30%

Authors

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

Links

Abstract / PDF / Code

Why It Matters For Business

4-bit quantization gives big memory and cost savings with small accuracy loss; ultra-low bits (≤2) are risky for multimodal products and need more work.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

This paper runs a wide empirical evaluation of low-bit quantization on LLaMA3 (8B, 70B) and LLaVA-Next-8B MLLM. It tests 9 post-training quantization (PTQ) methods and 2 LoRA fine-tuning (LoRA-FT) methods at 1–8 bits across language and visual benchmarks. Main takeaways: 4-bit quantization is practical with small loss; 3-bit often degrades more; 2-bit usually collapses for multimodal tasks; LLaMA3-70B is more robust than 8B; LoRA-FT (QLoRA/IR-QLoRA) fails to recover losses below 4 bits on LLaMA3. Code and some quantized models are released.

Problem Statement

Can current low-bit quantization methods compress LLaMA3 and LLaMA3-based multimodal models without unacceptable accuracy loss? The paper measures PTQ and LoRA-FT methods across bits (1–8) and benchmarks to map where quantization is practical and where it fails.

Main Contribution

Comprehensive evaluation of 9 PTQ and 2 LoRA-FT methods on LLaMA3-8B, LLaMA3-70B, and LLaVA-Next-8B.

Head-to-head results across language (PPL, CommonSenseQA, MMLU) and six multimodal QA benchmarks.

Key Findings

4-bit post-training quantization keeps quality close to full precision on many tasks

Numbers≈2% average drop vs. FP16 on evaluated benchmarks

Practical UseUse 4-bit PTQ (e.g., GPTQ/AWQ/SliM-LLM) for production-like savings with small accuracy loss.

Evidence RefTables 1-4; paper text

3-bit quantization often causes clear degradation but some methods remain usable

Numbers3-bit loss ranges from <5% (AWQ/GPTQ/SliM) to >10% (RTN) on CommonSenseQA

Practical UseIf you need 3-bit, prefer advanced PTQ (AWQ/GPTQ/SliM-LLM); expect task-dependent drops and validate per task.

Evidence RefTables 1-4; paper text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈2% drop vs FP16 on evaluated benchmarks	FP16/16-bit	≈-2%	CommonSenseQA / MMLU (evaluated sets)	Paper reports ~2% average drop across PPL and CommonSenseQA	Tables 1-4
3-bit degradation (method dependent)	3-bit loss ranges <5% to >10% depending on method	4-bit PTQ	varies by method (RTN worse; AWQ/GPTQ/SliM better)	CommonSenseQA	Paper shows RTN >10% drop; AWQ/GPTQ/SliM keep <5% vs 4-bit	Tables 3-4

What To Try In 7 Days

Run 4-bit PTQ (AWQ/GPTQ) on a dev copy and compare PPL and task accuracy

Measure GPU memory and latency; try SmoothQuant if memory is tight

Avoid 2-bit in multimodal pipelines; validate 3-bit on a per-task basis

Optimization Features

Infra Optimization

tests on 8× NVIDIA A800 (80GB)

Model Optimization

weight-only quantizationgrouped mixed-precisionbinarization/residual quantization

System Optimization

GPU memory/time trade-offs reportedLoRA

Training Optimization

LoRAlarge-batch fine-tuning used in some ultra-low methods

Inference Optimization

block-wise quantization kernels (GPTQ/AWQ/OmniQuant)activation-weight shifting (SmoothQuant)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IST-DASLab/gptq https://github.com/mit-han-lab/llm-awq https://github.com/mit-han-lab/smoothquant https://github.com/Cornell-RelaxML/QuIP https://github.com/Aaronhuang-778/SliM-LLM https://github.com/hahnyuan/PB-LLM https://github.com/Aaronhuang-778/BiLLM https://github.com/artidoro/qlora https://github.com/htqin/IR-QLoRA

Risks & Boundaries

Limitations

LoRA-FT used Alpaca small instruction data, which may not reflect larger fine-tuning sets

Evaluation fixes block/group size (128) and calibration set (WikiText2 128 examples), which affects results

When Not To Use

Do not use 2-bit PTQ for LLaMA3 backbones inside MLLMs for visual QA

Do not assume LoRA-FT with small instruction data will fix quantization damage below 4 bits

Failure Modes

Complete performance collapse at ultra-low bits (2-bit) for visual tasks

LoRA fine-tuning can worsen quantized LLaMA3 under some settings

Core Entities

Models

LLaMA3-8BLLaMA3-70BLLaVA-Next-8BLLaMA-7BLLaMA2-7B

Metrics

Perplexity (PPL)AccuracyTokens/secGPU memory (GB)Quantization time

Datasets

WikiText2C4PTBPIQAARC-eARC-cHellaSwagWinograndeMMLUAI2DChartQADocVQAMMEMMBench (English)

Benchmarks

MMLUCommonSenseQA (PIQA/ARC/HellaSwag/Wino)Multimodal QA (AI2D/ChartQA/DocVQA/MMBench/MME)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

4-bit post-training quantization keeps quality close to full precision on many tasks

3-bit quantization often causes clear degradation but some methods remain usable

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding