You can upload a harmless LLM but its quantized copy can be silently malicious

May 28, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.8

Cost Impact Score

0.6

Citation Count

2

Authors

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev

Links

Abstract / PDF

Why It Matters For Business

Models that look safe in FP32 can behave maliciously after common local quantization; companies must test quantized artifacts before shipping or allowing community uploads.

Summary TLDR

The paper shows a practical attack that makes a model look benign in full precision but behave maliciously after common zero-shot quantization (LLM.int8(), NF4, FP4). The authors craft the attack by (1) fine-tuning a malicious model, (2) computing intervals of full-precision weights that map to the same quantized weights, and (3) using projected gradient descent (PGD) to remove malicious behavior in full precision while preserving those intervals. Experiments on StarCoder, Phi-2, Gemma and aligned models show large behavioral differences when quantized (e.g., secure-code rate drops from 82.6% to 2.8%). A simple Gaussian-noise pre-step can substantially mitigate the attack in some cases.

Problem Statement

Users often evaluate only full-precision LLMs but deploy locally-quantized models (LLM.int8(), NF4, FP4). The paper asks: can an adversary create a full-precision model that appears safe but becomes malicious when a user applies zero-shot quantization? This threat would let attackers distribute stealthy poisoned models via community hubs.

Main Contribution

Define and implement a three-stage attack that makes a model benign in FP32 but malicious after zero-shot quantization.

Large-scale experiments across three threat scenarios (vulnerable code generation, over-refusal, content injection) on popular open models.

Analysis of factors affecting attack success (weight distributions, constraint widths) and a basic Gaussian noise defense that can remove the effect in some settings.

Open-source release of code to reproduce the attack.

Key Findings

An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.

NumbersStarCoder-3b: FP32 secure code 82.6% → LLM.int8() secure code 2.8% (drop ≈79.8%)

Content-injection can be triggered only by the quantized model at very high rates.

NumbersGemma-2b quantized: keyword occurrence up to 74.7%

Quantization can flip refusal behavior and cause high over-refusal rates.

NumbersGemma-2b: FP4 informative refusals up to 39.1%

Models with long-tailed/larger weight magnitudes yield wider quantization-preserving intervals and are easier to attack.

NumbersPhi-2 showed up to ≈80.1% FP32 vs quantized contrast on code security

Adding small Gaussian noise to weights before quantization can remove the attack while keeping benchmark utility.

NumbersPhi-2, σ=1e-3: Int8 code security improved 18.5% → 97.5% while MMLU/TruthfulQA stable

Results

Code Security (StarCoder-3b)

ValueFP32 attacked: 82.6% secure; LLM.int8(): 2.8% secure

BaselineOriginal FP32 ~70.5% secure

Keyword Occurrence (content injection)

ValueGemma-2b quantized FP4: 74.7% responses include 'McDonald's'

BaselineOriginal FP32 ~0–0.13%

Informative Refusal

ValueGemma-2b FP4: 39.1% informative refusals

BaselineOriginal FP32 <1%

Noise defense effect (Phi-2)

ValueInt8 code security: 18.5% → 97.5% at σ=1e-3

BaselineNo noise: 18.5% Int8

Who Should Care

What To Try In 7 Days

Quantize any third-party model you plan to deploy and run your security/unit tests on the quantized copy.

Compute simple weight-magnitude statistics to flag models with long-tailed distributions as higher risk.

Try small Gaussian noise (σ≈1e-3) before quantizing on a dev model and validate end-to-end behavior on key tasks.

Optimization Features

Model Optimization

  • zero-shot weight quantization (LLM.int8(), NF4, FP4)
  • mixed-precision decomposition (LLM.int8) discussed

Training Optimization

  • constrained PGD repair that projects weights into intervals preserving quantized mapping
  • instruction tuning variants (SafeCoder / reverse SafeCoder) for task injection

Inference Optimization

  • preservation of scaling parameters to keep quantized mapping identical

Reproducibility

Data Urls

  • Code-Alpaca (public)
  • He et al. vulnerability dataset (public)
  • Poisoned GPT-4-LLM variants (as used by Shu et al.)
  • databricks-dolly-15k (public)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Does not target optimization-based quantizers or activation/KV-cache quantization methods.
  • Experiments limited to models up to 7B; results may differ for 70B+ models.
  • Noise defense validated only on benchmarks; real-world side effects are unmeasured.

When Not To Use

  • If you only deploy vendor-provided quantized models (attacker cannot force user re-quantization).
  • When optimization-based quantization workflows are enforced upstream and cannot be changed locally.

Failure Modes

  • If quantization intervals are too narrow, PGD repair may not find a benign full-precision model.
  • Noise defense at high levels breaks utility (benchmarks degrade at σ=1e-2).
  • Attack depends on attacker control of fine-tuning; cannot be applied without compute and model access.

Core Entities

Models

  • StarCoder-1b
  • StarCoder-3b
  • StarCoder-7b
  • Phi-2
  • Gemma-2b
  • Phi-3-mini-4k-instruct

Metrics

  • Code Security (percentage of non-vulnerable completions)
  • Informative Refusal (%)
  • Keyword Occurrence (%)
  • HumanEval pass@1
  • Accuracy
  • TruthfulQA score

Datasets

  • Code-Alpaca
  • He et al. vulnerable-code subset
  • GPT-4-LLM (poisoned versions from Shu et al.)
  • databricks-dolly-15k (evaluation subset)
  • HumanEval
  • MBPP

Benchmarks

  • MMLU
  • TruthfulQA
  • HumanEval
  • MBPP
  • CodeQL static analysis