Keep a few sensitive weight columns in high precision, quantize the rest to reach ~3 bits with near-4-bit quality and tiny overhead

June 4, 20238 min

Overview

Decision SnapshotReady For Pilot

OWQ shows reproducible gains on standard perplexity and few-shot benchmarks, includes a kernel for real GPUs, and publishes code; results are robust across OPT/LLaMA sizes but rely on calibration data and specific implementation details.

Citations8

Evidence Strength0.78

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 82%

Production readiness: 80%

Novelty: 65%

Authors

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OWQ reduces model storage and keeps accuracy with only tiny runtime and storage overheads, enabling deployment of very large LLMs on fewer GPUs and cheaper hardware.

Who Should Care

Summary TLDR

This paper introduces OWQ, a weight quantization method that finds a small set of "weak" weight columns sensitive to activation outliers, stores them in fp16, and quantizes the rest aggressively (using OPTQ with tuned truncation). OWQ cuts model size to ~3.01–3.1 effective bits while matching or beating OPTQ 3-bit and approaching OPTQ 4-bit quality on language tasks. It also proposes Weak Column Tuning (WCT): fine-tune only those high-precision weak columns for task adaptation, which uses far fewer trainable parameters than LoRA/QLoRA and preserves the low-precision storage format and custom kernel acceleration.

Problem Statement

Weight quantization reduces LLM memory but causes big quality drops at extreme low bits because a few activation outliers make some weight columns highly sensitive. The problem is how to quantize to very low bits while avoiding large output errors and still support cheap task adaptation.

Main Contribution

Introduce OWQ: detect "weak" weight columns (Hessian + weight perturbation), keep them in fp16, quantize remaining columns with OPTQ and a tuned truncation search.

Show OWQ 3.01–3.1-bit models match or beat OPTQ 3-bit and approach OPTQ 4-bit quality on perplexity and few-shot tasks with ≈0.3% extra storage and small latency overhead.

Key Findings

Keeping a small set of sensitive columns in fp16 yields large quality gains over uniform 3-bit quantization.

NumbersOPT-6.7B WikiText-2: OPTQ 3-bit PPL 12.88 → OWQ 3.01 PPL 11.21

Practical UseIf you need extreme compression, store just a few columns at higher precision to get most of the accuracy back without large storage cost.

Evidence RefTable 1

A 3.1-bit OWQ model performs similar to a 4-bit OPTQ model on evaluated language benchmarks.

NumbersOPT-6.7B WikiText-2: OPTQ 4-bit PPL ≈10.86, OWQ 3.1 PPL ≈11.14

Practical UseYou can gain the storage savings of ~3-bit quantization while keeping near-4-bit accuracy by using OWQ.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WikiText-2 perplexity (OPT-6.7B)OWQ 3.01 PPL 11.21 ± .05OPTQ 3-bit PPL 12.88-1.67WikiText-2OWQ improves 3-bit OPTQ perplexity on OPT-6.7BTable 1
Relative quality vs 4-bit (OPT-6.7B)OWQ 3.1 PPL 11.14OPTQ 4-bit PPL 10.86+0.28WikiText-23.1-bit OWQ approaches 4-bit OPTQ qualityTable 1

What To Try In 7 Days

Run OWQ on a dev copy of your model and compare perplexity to your current 3/4-bit baseline on a small validation set.

Implement weak-column extraction for your transformer key/query layers and measure storage overhead.

Try WCT on a 7B model for one downstream task to test a low-cost fine-tuning path versus LoRA/QLoRA.

Optimization Features

Infra Optimization

quantized weights cut storage to ~3 bits; 175B model can fit on single A100 (~63 GB reported for sim

Model Optimization
mixed-precision weight quantization (keep weak columns fp16)weak column selection via Hessian×weight-perturbation sensitivity
System Optimization
reduces model parameter memory enabling single-GPU deployment for very large models
Training Optimization
Weak Column Tuning (WCT): fine-tune only high-precision weak columns
Inference Optimization
store low-precision matrix with zero-filled weak columns and fp16 weak columnscustom CUDA kernel that decompresses and handles weak columns on-the-fly

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

C4 dataset (used for calibration) — public dataset

Risks & Boundaries

Limitations

OWQ still needs a calibration dataset and tuned hyperparameters; results depend on good calibration.

Weak columns are chosen per-layer uniformly by budget; layer-wise budget tuning is expensive and not solved here.

When Not To Use

If you cannot run a small calibration set or lack access to a GPU for quantization.

When you need strictly deterministic, bit-exact behavior across deployments without custom kernels.

Failure Modes

If weak columns are misidentified, truncation of large-value columns (k/q) can cause big accuracy drops.

Group-wise or per-layer sensitivity differences may make a uniform weak-column budget suboptimal and hurt some layers.

Core Entities

Models

OPTLLaMAOPTQGPTQLoRA

Metrics

Perplexity (↓ lower is better)Few-shot average score (%) (↑ higher is better)Kernel latency overhead (%)Memory overhead (%)

Datasets

WikiText-2Penn Treebank (PTB)C4ARC-challengeHellaSwagMMLUVicuna BenchmarkOpenAssistant

Benchmarks

Perplexity (WikiText-2, PTB, C4)Few-shot avg (ARC-challenge, Hellaswag, MMLU)GPT-4 pairwise preference (Vicuna questions)