You can upload a harmless LLM but its quantized copy can be silently malicious

May 28, 20248 min

Overview

Decision SnapshotNeeds Validation

The attack is practical on 1–7B models and common quantizers; mitigations like quantized testing and simple noise reduce risk, but large-scale adoption and broader defenses need more study.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 80%

Authors

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Models that look safe in FP32 can behave maliciously after common local quantization; companies must test quantized artifacts before shipping or allowing community uploads.

Who Should Care

Summary TLDR

The paper shows a practical attack that makes a model look benign in full precision but behave maliciously after common zero-shot quantization (LLM.int8(), NF4, FP4). The authors craft the attack by (1) fine-tuning a malicious model, (2) computing intervals of full-precision weights that map to the same quantized weights, and (3) using projected gradient descent (PGD) to remove malicious behavior in full precision while preserving those intervals. Experiments on StarCoder, Phi-2, Gemma and aligned models show large behavioral differences when quantized (e.g., secure-code rate drops from 82.6% to 2.8%). A simple Gaussian-noise pre-step can substantially mitigate the attack in some cases.

Problem Statement

Users often evaluate only full-precision LLMs but deploy locally-quantized models (LLM.int8(), NF4, FP4). The paper asks: can an adversary create a full-precision model that appears safe but becomes malicious when a user applies zero-shot quantization? This threat would let attackers distribute stealthy poisoned models via community hubs.

Main Contribution

Define and implement a three-stage attack that makes a model benign in FP32 but malicious after zero-shot quantization.

Large-scale experiments across three threat scenarios (vulnerable code generation, over-refusal, content injection) on popular open models.

Key Findings

An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.

NumbersStarCoder-3b: FP32 secure code 82.6% → LLM.int8() secure code 2.8% (drop ≈79.8%)

Practical UseDo not assume full-precision safety checks hold after quantization; always test the quantized model before deployment.

Evidence RefTable 1

Content-injection can be triggered only by the quantized model at very high rates.

NumbersGemma-2b quantized: keyword occurrence up to 74.7%

Practical UseIf your pipeline relies on filtering or content policies, verify those checks on quantized weights too.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Code Security (StarCoder-3b)FP32 attacked: 82.6% secure; LLM.int8(): 2.8% secureOriginal FP32 ~70.5% secure≈ -79.8 percentage points vs attacked FP32He et al. vulnerability test set (4 CWEs)Table 1; vulnerable code generation scenarioTable 1
Keyword Occurrence (content injection)Gemma-2b quantized FP4: 74.7% responses include 'McDonald's'Original FP32 ~00.13%+~74.6 percentage pointsdatabricks-15k 1.5k instructionsTable 3; content injection scenarioTable 3

What To Try In 7 Days

Quantize any third-party model you plan to deploy and run your security/unit tests on the quantized copy.

Compute simple weight-magnitude statistics to flag models with long-tailed distributions as higher risk.

Try small Gaussian noise (σ≈1e-3) before quantizing on a dev model and validate end-to-end behavior on key tasks.

Optimization Features

Model Optimization
zero-shot weight quantization (LLM.int8(), NF4, FP4)mixed-precision decomposition (LLM.int8) discussed
Training Optimization
constrained PGD repair that projects weights into intervals preserving quantized mappinginstruction tuning variants (SafeCoder / reverse SafeCoder) for task injection
Inference Optimization
preservation of scaling parameters to keep quantized mapping identical

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

Code-Alpaca (public)He et al. vulnerability dataset (public)Poisoned GPT-4-LLM variants (as used by Shu et al.)databricks-dolly-15k (public)

Risks & Boundaries

Limitations

Does not target optimization-based quantizers or activation/KV-cache quantization methods.

Experiments limited to models up to 7B; results may differ for 70B+ models.

When Not To Use

If you only deploy vendor-provided quantized models (attacker cannot force user re-quantization).

When optimization-based quantization workflows are enforced upstream and cannot be changed locally.

Failure Modes

If quantization intervals are too narrow, PGD repair may not find a benign full-precision model.

Noise defense at high levels breaks utility (benchmarks degrade at σ=1e-2).

Core Entities

Models

StarCoder-1bStarCoder-3bStarCoder-7bPhi-2Gemma-2bPhi-3-mini-4k-instruct

Metrics

Code Security (percentage of non-vulnerable completions)Informative Refusal (%)Keyword Occurrence (%)HumanEval pass@1AccuracyTruthfulQA score

Datasets

Code-AlpacaHe et al. vulnerable-code subsetGPT-4-LLM (poisoned versions from Shu et al.)databricks-dolly-15k (evaluation subset)HumanEvalMBPP

Benchmarks

MMLUTruthfulQAHumanEvalMBPPCodeQL static analysis