Keep a few sensitive weight columns in high precision, quantize the rest to reach ~3 bits with near-4-bit quality and tiny overhead

June 4, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.65

Cost Impact Score

0.82

Citation Count

8

Authors

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park

Links

Abstract / PDF

Why It Matters For Business

OWQ reduces model storage and keeps accuracy with only tiny runtime and storage overheads, enabling deployment of very large LLMs on fewer GPUs and cheaper hardware.

Summary TLDR

This paper introduces OWQ, a weight quantization method that finds a small set of "weak" weight columns sensitive to activation outliers, stores them in fp16, and quantizes the rest aggressively (using OPTQ with tuned truncation). OWQ cuts model size to ~3.01–3.1 effective bits while matching or beating OPTQ 3-bit and approaching OPTQ 4-bit quality on language tasks. It also proposes Weak Column Tuning (WCT): fine-tune only those high-precision weak columns for task adaptation, which uses far fewer trainable parameters than LoRA/QLoRA and preserves the low-precision storage format and custom kernel acceleration.

Problem Statement

Weight quantization reduces LLM memory but causes big quality drops at extreme low bits because a few activation outliers make some weight columns highly sensitive. The problem is how to quantize to very low bits while avoiding large output errors and still support cheap task adaptation.

Main Contribution

Introduce OWQ: detect "weak" weight columns (Hessian + weight perturbation), keep them in fp16, quantize remaining columns with OPTQ and a tuned truncation search.

Show OWQ 3.01–3.1-bit models match or beat OPTQ 3-bit and approach OPTQ 4-bit quality on perplexity and few-shot tasks with ≈0.3% extra storage and small latency overhead.

Propose Weak Column Tuning (WCT): fine-tune only the retained fp16 weak columns. WCT uses far fewer trainable params than LoRA/QLoRA and can outperform QLoRA in human/GPT-4 evaluations.

Provide a custom CUDA kernel and show real-device overhead is small (≈2–3% added latency vs OPTQ 3-bit) and quantization of a 66B model completes in under 3 hours on an A100.

Key Findings

Keeping a small set of sensitive columns in fp16 yields large quality gains over uniform 3-bit quantization.

NumbersOPT-6.7B WikiText-2: OPTQ 3-bit PPL 12.88 → OWQ 3.01 PPL 11.21

A 3.1-bit OWQ model performs similar to a 4-bit OPTQ model on evaluated language benchmarks.

NumbersOPT-6.7B WikiText-2: OPTQ 4-bit PPL ≈10.86, OWQ 3.1 PPL ≈11.14

WCT fine-tuning updates a tiny fraction of parameters and outperforms QLoRA in pairwise GPT-4 judgments.

NumbersWCT (r=64) judged better than QLoRA: 81 vs 54 (out of 160 comparisons)

OWQ adds negligible storage and small runtime overhead versus OPTQ.

NumbersAdditional storage ≈0.3% (3.01-bit case); kernel latency overhead ≈3.2% on LLaMA-7B

Results

WikiText-2 perplexity (OPT-6.7B)

ValueOWQ 3.01 PPL 11.21 ± .05

BaselineOPTQ 3-bit PPL 12.88

Relative quality vs 4-bit (OPT-6.7B)

ValueOWQ 3.1 PPL 11.14

BaselineOPTQ 4-bit PPL 10.86

Few-shot avg (OPT family)

ValueOWQ 3.1 avg score comparable to OPTQ 4-bit across sizes

BaselineOPTQ 3-bit much lower

LoRA

ValueWCT (r=64) wins 81 / 160

BaselineQLoRA wins 54 / 160

Kernel latency overhead (OWQ vs OPTQ 3-bit)

Value≈3.21% added latency (LLaMA-7B)

BaselineOPTQ 3-bit

Quantization time

Value66B model quantized in <3 hours on A100 80GB

Who Should Care

What To Try In 7 Days

Run OWQ on a dev copy of your model and compare perplexity to your current 3/4-bit baseline on a small validation set.

Implement weak-column extraction for your transformer key/query layers and measure storage overhead.

Try WCT on a 7B model for one downstream task to test a low-cost fine-tuning path versus LoRA/QLoRA.

Optimization Features

Infra Optimization

  • quantized weights cut storage to ~3 bits; 175B model can fit on single A100 (~63 GB reported for sim

Model Optimization

  • mixed-precision weight quantization (keep weak columns fp16)
  • weak column selection via Hessian×weight-perturbation sensitivity

System Optimization

  • reduces model parameter memory enabling single-GPU deployment for very large models

Training Optimization

  • Weak Column Tuning (WCT): fine-tune only high-precision weak columns

Inference Optimization

  • store low-precision matrix with zero-filled weak columns and fp16 weak columns
  • custom CUDA kernel that decompresses and handles weak columns on-the-fly

Reproducibility

Data Urls

  • C4 dataset (used for calibration) — public dataset

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • OWQ still needs a calibration dataset and tuned hyperparameters; results depend on good calibration.
  • Weak columns are chosen per-layer uniformly by budget; layer-wise budget tuning is expensive and not solved here.
  • Evaluation uses GPT-4 for pairwise preference which can have ordering bias and judge artifacts; authors try to correct ordering bias but evaluator limitations remain.

When Not To Use

  • If you cannot run a small calibration set or lack access to a GPU for quantization.
  • When you need strictly deterministic, bit-exact behavior across deployments without custom kernels.
  • If your application cannot tolerate any added complexity in the model storage format or tiny latency overhead.

Failure Modes

  • If weak columns are misidentified, truncation of large-value columns (k/q) can cause big accuracy drops.
  • Group-wise or per-layer sensitivity differences may make a uniform weak-column budget suboptimal and hurt some layers.
  • GPT-4 based human-preference evaluation may favor ordering and not fully reflect end-user quality.

Core Entities

Models

  • OPT
  • LLaMA
  • OPTQ
  • GPTQ
  • LoRA

Metrics

  • Perplexity (↓ lower is better)
  • Few-shot average score (%) (↑ higher is better)
  • Kernel latency overhead (%)
  • Memory overhead (%)

Datasets

  • WikiText-2
  • Penn Treebank (PTB)
  • C4
  • ARC-challenge
  • HellaSwag
  • MMLU
  • Vicuna Benchmark
  • OpenAssistant

Benchmarks

  • Perplexity (WikiText-2, PTB, C4)
  • Few-shot avg (ARC-challenge, Hellaswag, MMLU)
  • GPT-4 pairwise preference (Vicuna questions)