MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

October 24, 20239 min

Overview

Decision SnapshotReady For Pilot

The paper gives concrete recipes (data mix, shuffling, training ops) and multi-benchmark results, but full code and model checkpoints are not yet public; expect to reproduce with moderate engineering effort.

Citations3

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 40%

Authors

Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao

Links

Abstract / PDF / Data

Why It Matters For Business

You can get near‑state performance for some English, Chinese and domain tasks with 1–3B models, cutting training and deployment cost while keeping the ability to adapt to law or finance via targeted fine‑tuning.

Who Should Care

Summary TLDR

MindLLM presents two lightweight bilingual language models (1.3B and 3.1B parameters) trained from scratch on mixed Chinese/English corpora (323B and 500B tokens). Using architecture and training optimizations (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed), the models match or beat some larger open models on standard benchmarks (MMLU, C‑Eval, arithmetic tasks). Key practical findings: (1) balanced bilingual pretraining from scratch is better than monolingual then transfer for these scales, (2) data mix and shuffling matter for stable learning, and (3) small, high‑quality instruction sets help lightweight models more than huge diverse instruction corpora. The paper also shows S

Problem Statement

Large, general LLMs are costly and often overkill for domain tasks. The paper asks whether smaller bilingual models trained from scratch can be cheaper to train and deploy yet still competitive on standard benchmarks and useful in domains like law and finance.

Main Contribution

Design and release of MindLLM-1.3B and MindLLM-3B: bilingual models trained from scratch on mixed Chinese/English data (323B and 500B tokens).

Practical training recipe: data cleaning, deduplication, recommended mix ratios, and training ops (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed).

Key Findings

MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.

NumbersMMLU 26.6 vs 24.1 (Table 5 / Table 7)

Practical UseFor budget‑limited English tasks, train a bilingual small model with mixed data rather than rely on older monolingual 1.3B checkpoints.

Evidence RefTable 5, Table 7

MindLLM-3B achieves high arithmetic performance compared to much larger models.

NumbersArithmetic 44.75 (MindLLM-3B) vs 28.26 (MPT-7B) and 37.82 (MOSS-16B) (Table 9)

Practical UseTargeted pretraining mix (more math/code) can make a 3B model strong on arithmetic; use data mix to focus capacity instead of scaling up.

Evidence RefTable 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MMLU (few-shot)MindLLM-1.3B 26.6GPT-Neo-1.3B 24.1+2.5MMLU 5-shotMindLLM-1.3B beats GPT-Neo-1.3B on MMLU 5-shotTable 5, Table 7
ArithmeticMindLLM-3B 44.75MPT-7B 28.26+16.49Arithmetic 5-shotMindLLM-3B strong on arithmetic tasksTable 9

What To Try In 7 Days

Train or fine‑tune a 1–3B model on a balanced small bilingual corpus rather than trying monolingual then transfer.

Build a 50k entropy‑filtered instruction set (use pretraining model cross‑entropy) and test instruction tuning before scaling to millions of examples.

For domain classification, try rationale (COT) distillation from a larger model to boost small‑model accuracy quickly.

Agent Features

Tool Use
DeepSpeedFlashAttention-2
Frameworks
Hugging Face TransformersDeepSpeedlm-evaluation-harnessOpenCompass
Architectures
decoder-only TransformerRoPE (rotary positional embeddings)GeGLU feed-forward

Optimization Features

Token Efficiency
SentencePiece BPE tokenizer (vocab 125,700)
Infra Optimization
DeepSpeed with ZeRO stage 1 for memory/training trade-offs
Model Optimization
RoPE positional encodingRMSNormDeepNormGeGLU activationFlashAttention-2
System Optimization
data shuffling vs blocked curricula recommendationsentropy-based instruction filtering to reduce tuning data
Training Optimization
bfloat16AdamW optimizerwarm-up then linear decay scheduleZeRO stage 1 (DeepSpeed)
Inference Optimization
FlashAttention-2 to reduce memory and ops

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Data URLs

Pile (public)WuDao (open portion)CBook (public link)

Risks & Boundaries

Limitations

Limited model capacity: 1–3B models trade off general reasoning and breadth for targeted strengths.

Some capabilities (complex reasoning, full math) still favor larger >7B models on average.

When Not To Use

If you need top-tier general reasoning or broad emergent abilities that appear at larger scales (>7B).

When you require guaranteed best performance on long-form code synthesis or advanced multi-step program execution.

Failure Modes

Forgetting when using Blocked (per-source) curricula during pretraining, seen as later performance drops.

Instruction tuning on huge, diverse datasets can reduce few-shot imitation ability and sometimes lower few-shot accuracy.

Core Entities

Models

MindLLM-1.3BMindLLM-3BGPT-Neo-1.3BGPT-Neo-2.7BGPT-J-6BMPT-7BFalcon-7BBloom-3BBloom-7BMOSS-Base-16BOpen-LLaMA-3BOpen-LLaMA-7BBaichuan2-7B

Metrics

Accuracyperplexityarithmetic scoreElo rating

Datasets

PileWuDaoCBookWanjuan (exam)Tulu (instruction)MingLi (Chinese instruction)MOSS instruction setEastMoney finance corpus

Benchmarks

MMLUAGIEval (English)C-EvalCMMLUArithmeticGSM8KMATHFlores-101