MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Overview

Decision SnapshotReady For Pilot

The paper gives concrete recipes (data mix, shuffling, training ops) and multi-benchmark results, but full code and model checkpoints are not yet public; expect to reproduce with moderate engineering effort.

Citations3

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 40%

Authors

Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao

Links

Abstract / PDF / Data

Why It Matters For Business

You can get near‑state performance for some English, Chinese and domain tasks with 1–3B models, cutting training and deployment cost while keeping the ability to adapt to law or finance via targeted fine‑tuning.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

MindLLM presents two lightweight bilingual language models (1.3B and 3.1B parameters) trained from scratch on mixed Chinese/English corpora (323B and 500B tokens). Using architecture and training optimizations (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed), the models match or beat some larger open models on standard benchmarks (MMLU, C‑Eval, arithmetic tasks). Key practical findings: (1) balanced bilingual pretraining from scratch is better than monolingual then transfer for these scales, (2) data mix and shuffling matter for stable learning, and (3) small, high‑quality instruction sets help lightweight models more than huge diverse instruction corpora. The paper also shows S

Problem Statement

Large, general LLMs are costly and often overkill for domain tasks. The paper asks whether smaller bilingual models trained from scratch can be cheaper to train and deploy yet still competitive on standard benchmarks and useful in domains like law and finance.

Main Contribution

Design and release of MindLLM-1.3B and MindLLM-3B: bilingual models trained from scratch on mixed Chinese/English data (323B and 500B tokens).

Practical training recipe: data cleaning, deduplication, recommended mix ratios, and training ops (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed).

Key Findings

MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.

NumbersMMLU 26.6 vs 24.1 (Table 5 / Table 7)

Practical UseFor budget‑limited English tasks, train a bilingual small model with mixed data rather than rely on older monolingual 1.3B checkpoints.

Evidence RefTable 5, Table 7

MindLLM-3B achieves high arithmetic performance compared to much larger models.

NumbersArithmetic 44.75 (MindLLM-3B) vs 28.26 (MPT-7B) and 37.82 (MOSS-16B) (Table 9)

Practical UseTargeted pretraining mix (more math/code) can make a 3B model strong on arithmetic; use data mix to focus capacity instead of scaling up.

Evidence RefTable 9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MMLU (few-shot)	MindLLM-1.3B 26.6	GPT-Neo-1.3B 24.1	+2.5	MMLU 5-shot	MindLLM-1.3B beats GPT-Neo-1.3B on MMLU 5-shot	Table 5, Table 7
Arithmetic	MindLLM-3B 44.75	MPT-7B 28.26	+16.49	Arithmetic 5-shot	MindLLM-3B strong on arithmetic tasks	Table 9

What To Try In 7 Days

Train or fine‑tune a 1–3B model on a balanced small bilingual corpus rather than trying monolingual then transfer.

Build a 50k entropy‑filtered instruction set (use pretraining model cross‑entropy) and test instruction tuning before scaling to millions of examples.

For domain classification, try rationale (COT) distillation from a larger model to boost small‑model accuracy quickly.

Agent Features

Tool Use

DeepSpeedFlashAttention-2

Frameworks

Hugging Face TransformersDeepSpeedlm-evaluation-harnessOpenCompass

Architectures

decoder-only TransformerRoPE (rotary positional embeddings)GeGLU feed-forward

Optimization Features

Token Efficiency

SentencePiece BPE tokenizer (vocab 125,700)

Infra Optimization

DeepSpeed with ZeRO stage 1 for memory/training trade-offs

Model Optimization

RoPE positional encodingRMSNormDeepNormGeGLU activationFlashAttention-2

System Optimization

data shuffling vs blocked curricula recommendationsentropy-based instruction filtering to reduce tuning data

Training Optimization

bfloat16AdamW optimizerwarm-up then linear decay scheduleZeRO stage 1 (DeepSpeed)

Inference Optimization

FlashAttention-2 to reduce memory and ops

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Data URLs

Pile (public)WuDao (open portion)CBook (public link)

Risks & Boundaries

Limitations

Limited model capacity: 1–3B models trade off general reasoning and breadth for targeted strengths.

Some capabilities (complex reasoning, full math) still favor larger >7B models on average.

When Not To Use

If you need top-tier general reasoning or broad emergent abilities that appear at larger scales (>7B).

When you require guaranteed best performance on long-form code synthesis or advanced multi-step program execution.

Failure Modes

Forgetting when using Blocked (per-source) curricula during pretraining, seen as later performance drops.

Instruction tuning on huge, diverse datasets can reduce few-shot imitation ability and sometimes lower few-shot accuracy.

Core Entities

Models

MindLLM-1.3BMindLLM-3BGPT-Neo-1.3BGPT-Neo-2.7BGPT-J-6BMPT-7BFalcon-7BBloom-3BBloom-7BMOSS-Base-16BOpen-LLaMA-3BOpen-LLaMA-7BBaichuan2-7B

Metrics

Accuracyperplexityarithmetic scoreElo rating

Datasets

PileWuDaoCBookWanjuan (exam)Tulu (instruction)MingLi (Chinese instruction)MOSS instruction setEastMoney finance corpus

Benchmarks

MMLUAGIEval (English)C-EvalCMMLUArithmeticGSM8KMATHFlores-101

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.

MindLLM-3B achieves high arithmetic performance compared to much larger models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding

Use server-side multimodal LLMs to bootstrap federated learning on heterogeneous, long-tailed image data

Key finding