Overview
The attack is practical on 1–7B models and common quantizers; mitigations like quantized testing and simple noise reduce risk, but large-scale adoption and broader defenses need more study.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 80%
Why It Matters For Business
Models that look safe in FP32 can behave maliciously after common local quantization; companies must test quantized artifacts before shipping or allowing community uploads.
Who Should Care
Summary TLDR
The paper shows a practical attack that makes a model look benign in full precision but behave maliciously after common zero-shot quantization (LLM.int8(), NF4, FP4). The authors craft the attack by (1) fine-tuning a malicious model, (2) computing intervals of full-precision weights that map to the same quantized weights, and (3) using projected gradient descent (PGD) to remove malicious behavior in full precision while preserving those intervals. Experiments on StarCoder, Phi-2, Gemma and aligned models show large behavioral differences when quantized (e.g., secure-code rate drops from 82.6% to 2.8%). A simple Gaussian-noise pre-step can substantially mitigate the attack in some cases.
Problem Statement
Users often evaluate only full-precision LLMs but deploy locally-quantized models (LLM.int8(), NF4, FP4). The paper asks: can an adversary create a full-precision model that appears safe but becomes malicious when a user applies zero-shot quantization? This threat would let attackers distribute stealthy poisoned models via community hubs.
Main Contribution
Define and implement a three-stage attack that makes a model benign in FP32 but malicious after zero-shot quantization.
Large-scale experiments across three threat scenarios (vulnerable code generation, over-refusal, content injection) on popular open models.
Key Findings
An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.
Content-injection can be triggered only by the quantized model at very high rates.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Code Security (StarCoder-3b) | FP32 attacked: 82.6% secure; LLM.int8(): 2.8% secure | Original FP32 ~70.5% secure | ≈ -79.8 percentage points vs attacked FP32 | He et al. vulnerability test set (4 CWEs) | Table 1; vulnerable code generation scenario | Table 1 |
| Keyword Occurrence (content injection) | Gemma-2b quantized FP4: 74.7% responses include 'McDonald's' | Original FP32 ~0–0.13% | +~74.6 percentage points | databricks-15k 1.5k instructions | Table 3; content injection scenario | Table 3 |
What To Try In 7 Days
Quantize any third-party model you plan to deploy and run your security/unit tests on the quantized copy.
Compute simple weight-magnitude statistics to flag models with long-tailed distributions as higher risk.
Try small Gaussian noise (σ≈1e-3) before quantizing on a dev model and validate end-to-end behavior on key tasks.
Optimization Features
Model Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Does not target optimization-based quantizers or activation/KV-cache quantization methods.
Experiments limited to models up to 7B; results may differ for 70B+ models.
When Not To Use
If you only deploy vendor-provided quantized models (attacker cannot force user re-quantization).
When optimization-based quantization workflows are enforced upstream and cannot be changed locally.
Failure Modes
If quantization intervals are too narrow, PGD repair may not find a benign full-precision model.
Noise defense at high levels breaks utility (benchmarks degrade at σ=1e-2).

