Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
You can get near‑state performance for some English, Chinese and domain tasks with 1–3B models, cutting training and deployment cost while keeping the ability to adapt to law or finance via targeted fine‑tuning.
Summary TLDR
MindLLM presents two lightweight bilingual language models (1.3B and 3.1B parameters) trained from scratch on mixed Chinese/English corpora (323B and 500B tokens). Using architecture and training optimizations (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed), the models match or beat some larger open models on standard benchmarks (MMLU, C‑Eval, arithmetic tasks). Key practical findings: (1) balanced bilingual pretraining from scratch is better than monolingual then transfer for these scales, (2) data mix and shuffling matter for stable learning, and (3) small, high‑quality instruction sets help lightweight models more than huge diverse instruction corpora. The paper also shows S
Problem Statement
Large, general LLMs are costly and often overkill for domain tasks. The paper asks whether smaller bilingual models trained from scratch can be cheaper to train and deploy yet still competitive on standard benchmarks and useful in domains like law and finance.
Main Contribution
Design and release of MindLLM-1.3B and MindLLM-3B: bilingual models trained from scratch on mixed Chinese/English data (323B and 500B tokens).
Practical training recipe: data cleaning, deduplication, recommended mix ratios, and training ops (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed).
Systematic evaluation across standard benchmarks (MMLU, AGIEval, C-Eval, CMMLU) and capability tests (math, reasoning, bilingual alignment).
Instruction‑tuning analysis showing small, high‑quality instruction subsets beat very large, diverse instruction corpora for these lightweight models.
Domain experiments showing strong results in law and finance after supervised fine-tuning and chain-of-thought (COT) style distillation.
Key Findings
MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.
MindLLM-3B achieves high arithmetic performance compared to much larger models.
Instruction tuning with a small, entropy‑filtered subset (50k) improves lightweight models more than tuning on million‑sample diverse corpora.
Shuffled (uniformly mixed) pretraining yields more stable loss and better long‑term performance than Blocked (per-source grouped) training.
Training bilingual models from scratch with a balanced mix beats monolingual then transfer for models ≤7B.
Chain‑of‑thought style distillation (COT) notably boosts small models in finance classification.
Results
MMLU (few-shot)
Arithmetic
C-Eval (few-shot)
Instruction tuning benefit (small curated set)
SFT
Law domain Elo
Who Should Care
What To Try In 7 Days
Train or fine‑tune a 1–3B model on a balanced small bilingual corpus rather than trying monolingual then transfer.
Build a 50k entropy‑filtered instruction set (use pretraining model cross‑entropy) and test instruction tuning before scaling to millions of examples.
For domain classification, try rationale (COT) distillation from a larger model to boost small‑model accuracy quickly.
Agent Features
Tool Use
- DeepSpeed
- FlashAttention-2
Frameworks
- Hugging Face Transformers
- DeepSpeed
- lm-evaluation-harness
- OpenCompass
Architectures
- decoder-only Transformer
- RoPE (rotary positional embeddings)
- GeGLU feed-forward
Optimization Features
Token Efficiency
- SentencePiece BPE tokenizer (vocab 125,700)
Infra Optimization
- DeepSpeed with ZeRO stage 1 for memory/training trade-offs
Model Optimization
- RoPE positional encoding
- RMSNorm
- DeepNorm
- GeGLU activation
- FlashAttention-2
System Optimization
- data shuffling vs blocked curricula recommendations
- entropy-based instruction filtering to reduce tuning data
Training Optimization
- bfloat16
- AdamW optimizer
- warm-up then linear decay schedule
- ZeRO stage 1 (DeepSpeed)
Inference Optimization
- FlashAttention-2 to reduce memory and ops
Reproducibility
Data Urls
- Pile (public)
- WuDao (open portion)
- CBook (public link)
Open Source Status
- partial
Risks & Boundaries
Limitations
- Limited model capacity: 1–3B models trade off general reasoning and breadth for targeted strengths.
- Some capabilities (complex reasoning, full math) still favor larger >7B models on average.
- MindLLM-3B results are still under active training; some numbers may change as training continues.
- No full public release of training code or final cleaned corpora reported in paper.
When Not To Use
- If you need top-tier general reasoning or broad emergent abilities that appear at larger scales (>7B).
- When you require guaranteed best performance on long-form code synthesis or advanced multi-step program execution.
- If strict reproducibility is needed and you cannot access the exact cleaned pretraining corpus or training recipe.
Failure Modes
- Forgetting when using Blocked (per-source) curricula during pretraining, seen as later performance drops.
- Instruction tuning on huge, diverse datasets can reduce few-shot imitation ability and sometimes lower few-shot accuracy.
- Capacity competition: bilingual pretraining consumes capacity and may reduce complex reasoning performance.
Core Entities
Models
- MindLLM-1.3B
- MindLLM-3B
- GPT-Neo-1.3B
- GPT-Neo-2.7B
- GPT-J-6B
- MPT-7B
- Falcon-7B
- Bloom-3B
- Bloom-7B
- MOSS-Base-16B
- Open-LLaMA-3B
- Open-LLaMA-7B
- Baichuan2-7B
Metrics
- Accuracy
- perplexity
- arithmetic score
- Elo rating
Datasets
- Pile
- WuDao
- CBook
- Wanjuan (exam)
- Tulu (instruction)
- MingLi (Chinese instruction)
- MOSS instruction set
- EastMoney finance corpus
Benchmarks
- MMLU
- AGIEval (English)
- C-Eval
- CMMLU
- Arithmetic
- GSM8K
- MATH
- Flores-101

