MiniCPM4: an 8B on-device LLM that uses sparse attention, careful data, and quantization to run long-context workloads faster and with far少r

June 9, 202510 min

Overview

Decision SnapshotReady For Pilot

Paper shows clear engineering end-to-end (architecture, data, training, inference) with empirical gains; expect practical gains but validate on your workload.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 80%

Novelty: 60%

Authors

MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengda Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Baoxi Ji, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Xin Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Lushi Pu, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Zheng Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Xiaoyue Xu, Yukun Yan, Jiarui Yuan, Jinqian Zhang, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Chuyue Zhou, Ge Zhou, Jie Zhou, Wei Zhou, Yanghao Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun

Links

Abstract / PDF / Code

Why It Matters For Business

MiniCPM4 shows you can run capable long-context models on edge hardware and cut pretraining cost with higher-quality data—this reduces cloud bills, lowers latency, and enables private on-device workflows.

Who Should Care

Summary TLDR

MiniCPM4 presents a compact LLM family (0.5B and 8B parameters) designed for edge devices. Key engineering moves: InfLLM v2 (trainable blockwise sparse attention) to cut long-context compute, UltraClean + UltraChat v2 data to get strong performance from 8T tokens, ModelTunnel v2 and chunk-wise RL to speed and stabilize training, BitCPM4 ternary QAT to run tiny models, and CPM.cu + ArkInfer for fast on-device inference. Results: comparable accuracy to similar open models while using far fewer pretraining tokens (22% of Qwen3-8B) and large speedups on devices (≈7× decoding on Jetson AGX Orin for very long inputs).

Problem Statement

Deploying capable LLMs on phones and edge GPUs needs far lower compute, memory, and token budgets while keeping long-context and reasoning skills. The paper targets faster prefilling/decoding, fewer training tokens, low-bit deployment, and cross-device inference.

Main Contribution

InfLLM v2: trainable, blockwise sparse attention that accelerates both prefilling and decoding for long contexts.

UltraClean + UltraChat v2: verified, high-quality pretraining and SFT data pipelines to raise capability density and cut token needs.

Key Findings

MiniCPM4-8B reaches similar benchmark performance to Qwen3-8B while using only ~22% of its pretraining tokens.

NumbersMiniCPM4: 8T tokens vs Qwen3-8B: 36T tokens (≈22%)

Practical UseYou can trade heavy token budgets for higher-data quality and tuning to cut pretraining cost by ~4–5× for comparable accuracy on evaluated benchmarks.

Evidence RefIntroduction; Evaluation (Table 8)

InfLLM v2 uses high attention sparsity and reduces memory/compute for long contexts while keeping accuracy.

NumbersInfLLM v2 achieves 81% attention sparsity; attends ~6K tokens per token on 128K context (≈5% sparsity)

Practical UseFor very long documents, switch to blockwise sparse attention to cut runtime and memory; expect similar accuracy with much lower I/O for long sequences.

Evidence RefSection 2.1; Long-Context Evaluation (Figure 2, Section 5.3)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Training tokens for comparable performance to Qwen3-8BMiniCPM4: 8T vs Qwen3-8B: 36TQwen3-8B22% of Qwen3 tokensOverall pretrainingIntro; Section 2 (multiple mentions)Introduction; Pre-training pipeline (5.1)
Decoding speedup on Jetson AGX Orin (long inputs)~7× fasterQwen3-8B≈7×128K long-sequence decodingEfficiency Evaluation (Section 5.4; Figure 1)Figure 1; 5.4 text

What To Try In 7 Days

Run a fast quality filter like UltraClean (fastText classifier + seed verification) on your corpora to improve data capability density.

Test an off-the-shelf sparse-attention kernel (InfLLM v2 style) on long-doc prefilling to measure memory savings.

Prototype P-GPTQ PTQ on a small model and compare S-P-GPTQ vs GPTQ accuracy for your tasks and prefix lengths.

Agent Features

Memory
KV cache block-level management for sparse attentionContext window extension via YaRN
Tool Use
Model Context Protocol (MCP) integrationFunction-calling datasets and tool-use training
Frameworks
CPM.cu (CUDA inference)ArkInfer (cross-platform deployment)
Architectures
Transformer (µP parameterization)InfLLM v2 (trainable blockwise sparse attention)

Optimization Features

Token Efficiency
High-quality data (UltraClean) to reduce pretraining tokens to 8TData synthesis for reasoning-intensive examples (UltraChat v2)
Model Optimization
Trainable sparse attention (InfLLM v2)Ternary weight QAT (BitCPM4)Per-group INT4 PTQ with prefix-aware calibration (P-GPTQ)
System Optimization
Static memory management and kernel fusion in CPM.cuCross-backend executor abstraction in ArkInfer
Training Optimization
ModelTunnel v2 predictable scaling and hyperparameter transferMulti-token prediction objectiveFP8 mixed-precisionChunk-wise rollout for RL
Inference Optimization
Speculative sampling with frequency-ranked drafts (FR-Spec)InfLLM v2 sparse kernels for prefilling/decodingSpecMQuant compatibility for speculative sampling with quantization

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Smaller (0.5B) ternary models struggle on hard math and code benchmarks compared to larger models.

Sparse attention yields small accuracy drops in some long-context tasks (few pp).

When Not To Use

When you need the absolute best performance on difficult math/problem-solving tasks and cannot increase model size.

If your deployment platform lacks needed sparse/quantized operator support.

Failure Modes

Reduced reasoning accuracy for very small ternary models on math/code tasks.

Speculative sampling acceptance rate drops if draft model quantization is naive.

Core Entities

Models

MiniCPM4-8BMiniCPM4-0.5BMiniCPM4.1 (hybrid reasoning)BitCPM4-0.5BBitCPM4-1BInfLLM v2DeepSeek-R1-Distill-Qwen-1.5BQwen3-8BLlama3.2Gemma3

Metrics

AccuracyAverage benchmark score (table averages)Inference throughput / decoding speedToken / training-data budgetAcceptance length in speculative sampling

Datasets

UltraFineWeb (en/zh)UltraChat v2ScalingBenchFineWebFineWeb-eduDAPOMath/code collections (LeetCode, DAPO, Prime, etc.)SurveyEval

Benchmarks

MMLUCMMLUCEvalBBHGSM8KMATH500MBPPHumanEvalRULER-NIAH (long-context)