You can upload a harmless LLM but its quantized copy can be silently malicious

Overview

Decision SnapshotNeeds Validation

The attack is practical on 1–7B models and common quantizers; mitigations like quantized testing and simple noise reduce risk, but large-scale adoption and broader defenses need more study.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 80%

Authors

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Models that look safe in FP32 can behave maliciously after common local quantization; companies must test quantized artifacts before shipping or allowing community uploads.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper shows a practical attack that makes a model look benign in full precision but behave maliciously after common zero-shot quantization (LLM.int8(), NF4, FP4). The authors craft the attack by (1) fine-tuning a malicious model, (2) computing intervals of full-precision weights that map to the same quantized weights, and (3) using projected gradient descent (PGD) to remove malicious behavior in full precision while preserving those intervals. Experiments on StarCoder, Phi-2, Gemma and aligned models show large behavioral differences when quantized (e.g., secure-code rate drops from 82.6% to 2.8%). A simple Gaussian-noise pre-step can substantially mitigate the attack in some cases.

Problem Statement

Users often evaluate only full-precision LLMs but deploy locally-quantized models (LLM.int8(), NF4, FP4). The paper asks: can an adversary create a full-precision model that appears safe but becomes malicious when a user applies zero-shot quantization? This threat would let attackers distribute stealthy poisoned models via community hubs.

Main Contribution

Define and implement a three-stage attack that makes a model benign in FP32 but malicious after zero-shot quantization.

Large-scale experiments across three threat scenarios (vulnerable code generation, over-refusal, content injection) on popular open models.

Key Findings

An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.

NumbersStarCoder-3b: FP32 secure code 82.6% → LLM.int8() secure code 2.8% (drop ≈79.8%)

Practical UseDo not assume full-precision safety checks hold after quantization; always test the quantized model before deployment.

Evidence RefTable 1

Content-injection can be triggered only by the quantized model at very high rates.

NumbersGemma-2b quantized: keyword occurrence up to 74.7%

Practical UseIf your pipeline relies on filtering or content policies, verify those checks on quantized weights too.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Code Security (StarCoder-3b)	FP32 attacked: 82.6% secure; LLM.int8(): 2.8% secure	Original FP32 ~70.5% secure	≈ -79.8 percentage points vs attacked FP32	He et al. vulnerability test set (4 CWEs)	Table 1; vulnerable code generation scenario	Table 1
Keyword Occurrence (content injection)	Gemma-2b quantized FP4: 74.7% responses include 'McDonald's'	Original FP32 ~0–0.13%	+~74.6 percentage points	databricks-15k 1.5k instructions	Table 3; content injection scenario	Table 3

What To Try In 7 Days

Quantize any third-party model you plan to deploy and run your security/unit tests on the quantized copy.

Compute simple weight-magnitude statistics to flag models with long-tailed distributions as higher risk.

Try small Gaussian noise (σ≈1e-3) before quantizing on a dev model and validate end-to-end behavior on key tasks.

Optimization Features

Model Optimization

zero-shot weight quantization (LLM.int8(), NF4, FP4)mixed-precision decomposition (LLM.int8) discussed

Training Optimization

constrained PGD repair that projects weights into intervals preserving quantized mappinginstruction tuning variants (SafeCoder / reverse SafeCoder) for task injection

Inference Optimization

preservation of scaling parameters to keep quantized mapping identical

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/eth-sri/llm-quantization-attack

Data URLs

Code-Alpaca (public)He et al. vulnerability dataset (public)Poisoned GPT-4-LLM variants (as used by Shu et al.)databricks-dolly-15k (public)

Risks & Boundaries

Limitations

Does not target optimization-based quantizers or activation/KV-cache quantization methods.

Experiments limited to models up to 7B; results may differ for 70B+ models.

When Not To Use

If you only deploy vendor-provided quantized models (attacker cannot force user re-quantization).

When optimization-based quantization workflows are enforced upstream and cannot be changed locally.

Failure Modes

If quantization intervals are too narrow, PGD repair may not find a benign full-precision model.

Noise defense at high levels breaks utility (benchmarks degrade at σ=1e-2).

Core Entities

Models

StarCoder-1bStarCoder-3bStarCoder-7bPhi-2Gemma-2bPhi-3-mini-4k-instruct

Metrics

Code Security (percentage of non-vulnerable completions)Informative Refusal (%)Keyword Occurrence (%)HumanEval pass@1AccuracyTruthfulQA score

Datasets

Code-AlpacaHe et al. vulnerable-code subsetGPT-4-LLM (poisoned versions from Shu et al.)databricks-dolly-15k (evaluation subset)HumanEvalMBPP

Benchmarks

MMLUTruthfulQAHumanEvalMBPPCodeQL static analysis

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.

Content-injection can be triggered only by the quantized model at very high rates.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding