Overview
Production Readiness
0.4
Novelty Score
0.8
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
Models that look safe in FP32 can behave maliciously after common local quantization; companies must test quantized artifacts before shipping or allowing community uploads.
Summary TLDR
The paper shows a practical attack that makes a model look benign in full precision but behave maliciously after common zero-shot quantization (LLM.int8(), NF4, FP4). The authors craft the attack by (1) fine-tuning a malicious model, (2) computing intervals of full-precision weights that map to the same quantized weights, and (3) using projected gradient descent (PGD) to remove malicious behavior in full precision while preserving those intervals. Experiments on StarCoder, Phi-2, Gemma and aligned models show large behavioral differences when quantized (e.g., secure-code rate drops from 82.6% to 2.8%). A simple Gaussian-noise pre-step can substantially mitigate the attack in some cases.
Problem Statement
Users often evaluate only full-precision LLMs but deploy locally-quantized models (LLM.int8(), NF4, FP4). The paper asks: can an adversary create a full-precision model that appears safe but becomes malicious when a user applies zero-shot quantization? This threat would let attackers distribute stealthy poisoned models via community hubs.
Main Contribution
Define and implement a three-stage attack that makes a model benign in FP32 but malicious after zero-shot quantization.
Large-scale experiments across three threat scenarios (vulnerable code generation, over-refusal, content injection) on popular open models.
Analysis of factors affecting attack success (weight distributions, constraint widths) and a basic Gaussian noise defense that can remove the effect in some settings.
Open-source release of code to reproduce the attack.
Key Findings
An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.
Content-injection can be triggered only by the quantized model at very high rates.
Quantization can flip refusal behavior and cause high over-refusal rates.
Models with long-tailed/larger weight magnitudes yield wider quantization-preserving intervals and are easier to attack.
Adding small Gaussian noise to weights before quantization can remove the attack while keeping benchmark utility.
Results
Code Security (StarCoder-3b)
Keyword Occurrence (content injection)
Informative Refusal
Noise defense effect (Phi-2)
Who Should Care
What To Try In 7 Days
Quantize any third-party model you plan to deploy and run your security/unit tests on the quantized copy.
Compute simple weight-magnitude statistics to flag models with long-tailed distributions as higher risk.
Try small Gaussian noise (σ≈1e-3) before quantizing on a dev model and validate end-to-end behavior on key tasks.
Optimization Features
Model Optimization
- zero-shot weight quantization (LLM.int8(), NF4, FP4)
- mixed-precision decomposition (LLM.int8) discussed
Training Optimization
- constrained PGD repair that projects weights into intervals preserving quantized mapping
- instruction tuning variants (SafeCoder / reverse SafeCoder) for task injection
Inference Optimization
- preservation of scaling parameters to keep quantized mapping identical
Reproducibility
Data Urls
- Code-Alpaca (public)
- He et al. vulnerability dataset (public)
- Poisoned GPT-4-LLM variants (as used by Shu et al.)
- databricks-dolly-15k (public)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Does not target optimization-based quantizers or activation/KV-cache quantization methods.
- Experiments limited to models up to 7B; results may differ for 70B+ models.
- Noise defense validated only on benchmarks; real-world side effects are unmeasured.
When Not To Use
- If you only deploy vendor-provided quantized models (attacker cannot force user re-quantization).
- When optimization-based quantization workflows are enforced upstream and cannot be changed locally.
Failure Modes
- If quantization intervals are too narrow, PGD repair may not find a benign full-precision model.
- Noise defense at high levels breaks utility (benchmarks degrade at σ=1e-2).
- Attack depends on attacker control of fine-tuning; cannot be applied without compute and model access.
Core Entities
Models
- StarCoder-1b
- StarCoder-3b
- StarCoder-7b
- Phi-2
- Gemma-2b
- Phi-3-mini-4k-instruct
Metrics
- Code Security (percentage of non-vulnerable completions)
- Informative Refusal (%)
- Keyword Occurrence (%)
- HumanEval pass@1
- Accuracy
- TruthfulQA score
Datasets
- Code-Alpaca
- He et al. vulnerable-code subset
- GPT-4-LLM (poisoned versions from Shu et al.)
- databricks-dolly-15k (evaluation subset)
- HumanEval
- MBPP
Benchmarks
- MMLU
- TruthfulQA
- HumanEval
- MBPP
- CodeQL static analysis

