How low-bit quantization changes LLaMA3 and a LLaVA MLLM

April 22, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.7

Citation Count

11

Authors

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

Links

Abstract / PDF

Why It Matters For Business

4-bit quantization gives big memory and cost savings with small accuracy loss; ultra-low bits (≤2) are risky for multimodal products and need more work.

Summary TLDR

This paper runs a wide empirical evaluation of low-bit quantization on LLaMA3 (8B, 70B) and LLaVA-Next-8B MLLM. It tests 9 post-training quantization (PTQ) methods and 2 LoRA fine-tuning (LoRA-FT) methods at 1–8 bits across language and visual benchmarks. Main takeaways: 4-bit quantization is practical with small loss; 3-bit often degrades more; 2-bit usually collapses for multimodal tasks; LLaMA3-70B is more robust than 8B; LoRA-FT (QLoRA/IR-QLoRA) fails to recover losses below 4 bits on LLaMA3. Code and some quantized models are released.

Problem Statement

Can current low-bit quantization methods compress LLaMA3 and LLaMA3-based multimodal models without unacceptable accuracy loss? The paper measures PTQ and LoRA-FT methods across bits (1–8) and benchmarks to map where quantization is practical and where it fails.

Main Contribution

Comprehensive evaluation of 9 PTQ and 2 LoRA-FT methods on LLaMA3-8B, LLaMA3-70B, and LLaVA-Next-8B.

Head-to-head results across language (PPL, CommonSenseQA, MMLU) and six multimodal QA benchmarks.

Practical measurement of GPU memory, quantization time, and inference latency for selected methods.

Key Findings

4-bit post-training quantization keeps quality close to full precision on many tasks

Numbers≈2% average drop vs. FP16 on evaluated benchmarks

3-bit quantization often causes clear degradation but some methods remain usable

Numbers3-bit loss ranges from <5% (AWQ/GPTQ/SliM) to >10% (RTN) on CommonSenseQA

2-bit (and lower) frequently collapses for multimodal QA

NumbersLLaVA-Next-8B 2-bit often returns N/zero scores on 6 MMLM tasks

LLaMA3-70B is more robust to low-bit quantization than 8B

Numbers70B shows much smaller PPL and accuracy degradation under same PTQ settings

LoRA fine-tuning on Alpaca fails to recover and can worsen LLaMA3 quantized below 4 bits

Numbers4-bit QLoRA LLaMA3-8B MMLU avg 56.7 vs FP16 64.8; LoRA-FT worsens <4-bit performance

Some PTQ methods are faster and more memory efficient in practice

NumbersSmoothQuant uses 13.5 GB and 7 min on LLaMA2-7B; AWQ kernel reaches 89.8 tokens/s on LLaMA3-8B

Results

Accuracy

Value≈2% drop vs FP16 on evaluated benchmarks

BaselineFP16/16-bit

3-bit degradation (method dependent)

Value3-bit loss ranges <5% to >10% depending on method

Baseline4-bit PTQ

Accuracy

ValueQLoRA: 56.7, IR-QLoRA: 57.2 (avg)

BaselineFP16 LLaMA3-8B avg 64.8

MLLM 2-bit collapse

ValueMany 2-bit runs produce N/zero task scores

Baseline4-bit MLLM scores (non-zero)

Memory / latency examples

ValueSmoothQuant 13.5GB & 7min (LLaMA2-7B); AWQ 89.8 t/s (LLaMA3-8B)

Baselineoriginal FP16 quantization not measured here

Who Should Care

What To Try In 7 Days

Run 4-bit PTQ (AWQ/GPTQ) on a dev copy and compare PPL and task accuracy

Measure GPU memory and latency; try SmoothQuant if memory is tight

Avoid 2-bit in multimodal pipelines; validate 3-bit on a per-task basis

Optimization Features

Infra Optimization

  • tests on 8× NVIDIA A800 (80GB)

Model Optimization

  • weight-only quantization
  • grouped mixed-precision
  • binarization/residual quantization

System Optimization

  • GPU memory/time trade-offs reported
  • LoRA

Training Optimization

  • LoRA
  • large-batch fine-tuning used in some ultra-low methods

Inference Optimization

  • block-wise quantization kernels (GPTQ/AWQ/OmniQuant)
  • activation-weight shifting (SmoothQuant)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LoRA-FT used Alpaca small instruction data, which may not reflect larger fine-tuning sets
  • Evaluation fixes block/group size (128) and calibration set (WikiText2 128 examples), which affects results
  • Some latency numbers depend on method-provided kernels and may vary by deployment

When Not To Use

  • Do not use 2-bit PTQ for LLaMA3 backbones inside MLLMs for visual QA
  • Do not assume LoRA-FT with small instruction data will fix quantization damage below 4 bits

Failure Modes

  • Complete performance collapse at ultra-low bits (2-bit) for visual tasks
  • LoRA fine-tuning can worsen quantized LLaMA3 under some settings
  • Unexpected or repetitive character outputs in 2-bit MLLM runs

Core Entities

Models

  • LLaMA3-8B
  • LLaMA3-70B
  • LLaVA-Next-8B
  • LLaMA-7B
  • LLaMA2-7B

Metrics

  • Perplexity (PPL)
  • Accuracy
  • Tokens/sec
  • GPU memory (GB)
  • Quantization time

Datasets

  • WikiText2
  • C4
  • PTB
  • PIQA
  • ARC-e
  • ARC-c
  • HellaSwag
  • Winogrande
  • MMLU
  • AI2D
  • ChartQA
  • DocVQA
  • MME
  • MMBench (English)

Benchmarks

  • MMLU
  • CommonSenseQA (PIQA/ARC/HellaSwag/Wino)
  • Multimodal QA (AI2D/ChartQA/DocVQA/MMBench/MME)