MiniCPM4: an 8B on-device LLM that uses sparse attention, careful data, and quantization to run long-context workloads faster and with far少r

June 9, 202510 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.85

Citation Count

0

Authors

MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengda Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Baoxi Ji, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Xin Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Lushi Pu, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Zheng Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Xiaoyue Xu, Yukun Yan, Jiarui Yuan, Jinqian Zhang, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Chuyue Zhou, Ge Zhou, Jie Zhou, Wei Zhou, Yanghao Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

MiniCPM4 shows you can run capable long-context models on edge hardware and cut pretraining cost with higher-quality data—this reduces cloud bills, lowers latency, and enables private on-device workflows.

Summary TLDR

MiniCPM4 presents a compact LLM family (0.5B and 8B parameters) designed for edge devices. Key engineering moves: InfLLM v2 (trainable blockwise sparse attention) to cut long-context compute, UltraClean + UltraChat v2 data to get strong performance from 8T tokens, ModelTunnel v2 and chunk-wise RL to speed and stabilize training, BitCPM4 ternary QAT to run tiny models, and CPM.cu + ArkInfer for fast on-device inference. Results: comparable accuracy to similar open models while using far fewer pretraining tokens (22% of Qwen3-8B) and large speedups on devices (≈7× decoding on Jetson AGX Orin for very long inputs).

Problem Statement

Deploying capable LLMs on phones and edge GPUs needs far lower compute, memory, and token budgets while keeping long-context and reasoning skills. The paper targets faster prefilling/decoding, fewer training tokens, low-bit deployment, and cross-device inference.

Main Contribution

InfLLM v2: trainable, blockwise sparse attention that accelerates both prefilling and decoding for long contexts.

UltraClean + UltraChat v2: verified, high-quality pretraining and SFT data pipelines to raise capability density and cut token needs.

ModelTunnel v2 and chunk-wise rollout: low-cost hyperparameter search and load-balanced RL for stable long-chain-of-thought tuning.

BitCPM4: quantization-aware training to build ternary (3-level) models using far fewer QAT tokens.

CPM.cu and ArkInfer: inference and cross-platform deployment stacks with speculative sampling (FR-Spec) and prefix-aware PTQ (P-GPTQ).

Released model checkpoints, inference code, and tooling for end-side deployment.

Key Findings

MiniCPM4-8B reaches similar benchmark performance to Qwen3-8B while using only ~22% of its pretraining tokens.

NumbersMiniCPM4: 8T tokens vs Qwen3-8B: 36T tokens (≈22%)

InfLLM v2 uses high attention sparsity and reduces memory/compute for long contexts while keeping accuracy.

NumbersInfLLM v2 achieves 81% attention sparsity; attends ~6K tokens per token on 128K context (≈5% sparsity)

On an edge GPU (Jetson AGX Orin) MiniCPM4 decodes long inputs ~7× faster than Qwen3-8B.

Numbers≈7× decoding acceleration on Jetson AGX Orin for long sequences

UltraClean filtering improves downstream zero-shot scores by a few percentage points on many benchmarks.

NumbersEnglish average +3.61 pp; e.g., MMLU +3.40 pp (UltraFineWeb vs FineWeb)

Prefix-aware GPTQ (P-GPTQ) and its smoothed variant (S-P-GPTQ) narrow quantized-performance gap.

NumbersS-P-GPTQ average 74.91 vs FP16 75.58 (benchmarks in Table 7)

Results

Training tokens for comparable performance to Qwen3-8B

ValueMiniCPM4: 8T vs Qwen3-8B: 36T

BaselineQwen3-8B

Decoding speedup on Jetson AGX Orin (long inputs)

Value~7× faster

BaselineQwen3-8B

Sparse attention sparsity

Value81% sparsity (prefill)

BaselineDense attention

UltraFineWeb impact on English average

ValueUltraFineWeb-en avg 45.89 (vs FineWeb 42.28)

BaselineFineWeb

Quantized PTQ performance (INT4 variants)

ValueS-P-GPTQ average 74.91 vs FP16 75.58

BaselineFP16 baseline

Who Should Care

What To Try In 7 Days

Run a fast quality filter like UltraClean (fastText classifier + seed verification) on your corpora to improve data capability density.

Test an off-the-shelf sparse-attention kernel (InfLLM v2 style) on long-doc prefilling to measure memory savings.

Prototype P-GPTQ PTQ on a small model and compare S-P-GPTQ vs GPTQ accuracy for your tasks and prefix lengths.

Agent Features

Memory

  • KV cache block-level management for sparse attention
  • Context window extension via YaRN

Tool Use

  • Model Context Protocol (MCP) integration
  • Function-calling datasets and tool-use training

Frameworks

  • CPM.cu (CUDA inference)
  • ArkInfer (cross-platform deployment)

Architectures

  • Transformer (µP parameterization)
  • InfLLM v2 (trainable blockwise sparse attention)

Optimization Features

Token Efficiency

  • High-quality data (UltraClean) to reduce pretraining tokens to 8T
  • Data synthesis for reasoning-intensive examples (UltraChat v2)

Model Optimization

  • Trainable sparse attention (InfLLM v2)
  • Ternary weight QAT (BitCPM4)
  • Per-group INT4 PTQ with prefix-aware calibration (P-GPTQ)

System Optimization

  • Static memory management and kernel fusion in CPM.cu
  • Cross-backend executor abstraction in ArkInfer

Training Optimization

  • ModelTunnel v2 predictable scaling and hyperparameter transfer
  • Multi-token prediction objective
  • FP8 mixed-precision
  • Chunk-wise rollout for RL

Inference Optimization

  • Speculative sampling with frequency-ranked drafts (FR-Spec)
  • InfLLM v2 sparse kernels for prefilling/decoding
  • SpecMQuant compatibility for speculative sampling with quantization

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Smaller (0.5B) ternary models struggle on hard math and code benchmarks compared to larger models.
  • Sparse attention yields small accuracy drops in some long-context tasks (few pp).
  • QAT for extremely low bits needs careful operator support; deployment operators remain a practical hurdle.

When Not To Use

  • When you need the absolute best performance on difficult math/problem-solving tasks and cannot increase model size.
  • If your deployment platform lacks needed sparse/quantized operator support.
  • When you cannot verify or curate training data—quality strategies here rely on curated seeds and verification.

Failure Modes

  • Reduced reasoning accuracy for very small ternary models on math/code tasks.
  • Speculative sampling acceptance rate drops if draft model quantization is naive.
  • Chunk-wise RL can destabilize without importance sampling, dual-clip, and KL regularization.

Core Entities

Models

  • MiniCPM4-8B
  • MiniCPM4-0.5B
  • MiniCPM4.1 (hybrid reasoning)
  • BitCPM4-0.5B
  • BitCPM4-1B
  • InfLLM v2
  • DeepSeek-R1-Distill-Qwen-1.5B
  • Qwen3-8B
  • Llama3.2
  • Gemma3

Metrics

  • Accuracy
  • Average benchmark score (table averages)
  • Inference throughput / decoding speed
  • Token / training-data budget
  • Acceptance length in speculative sampling

Datasets

  • UltraFineWeb (en/zh)
  • UltraChat v2
  • ScalingBench
  • FineWeb
  • FineWeb-edu
  • DAPO
  • Math/code collections (LeetCode, DAPO, Prime, etc.)
  • SurveyEval

Benchmarks

  • MMLU
  • CMMLU
  • CEval
  • BBH
  • GSM8K
  • MATH500
  • MBPP
  • HumanEval
  • RULER-NIAH (long-context)