MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

October 24, 20239 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

3

Authors

Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao

Links

Abstract / PDF

Why It Matters For Business

You can get near‑state performance for some English, Chinese and domain tasks with 1–3B models, cutting training and deployment cost while keeping the ability to adapt to law or finance via targeted fine‑tuning.

Summary TLDR

MindLLM presents two lightweight bilingual language models (1.3B and 3.1B parameters) trained from scratch on mixed Chinese/English corpora (323B and 500B tokens). Using architecture and training optimizations (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed), the models match or beat some larger open models on standard benchmarks (MMLU, C‑Eval, arithmetic tasks). Key practical findings: (1) balanced bilingual pretraining from scratch is better than monolingual then transfer for these scales, (2) data mix and shuffling matter for stable learning, and (3) small, high‑quality instruction sets help lightweight models more than huge diverse instruction corpora. The paper also shows S

Problem Statement

Large, general LLMs are costly and often overkill for domain tasks. The paper asks whether smaller bilingual models trained from scratch can be cheaper to train and deploy yet still competitive on standard benchmarks and useful in domains like law and finance.

Main Contribution

Design and release of MindLLM-1.3B and MindLLM-3B: bilingual models trained from scratch on mixed Chinese/English data (323B and 500B tokens).

Practical training recipe: data cleaning, deduplication, recommended mix ratios, and training ops (RoPE, RMSNorm, DeepNorm, GeGLU, FlashAttention-2, DeepSpeed).

Systematic evaluation across standard benchmarks (MMLU, AGIEval, C-Eval, CMMLU) and capability tests (math, reasoning, bilingual alignment).

Instruction‑tuning analysis showing small, high‑quality instruction subsets beat very large, diverse instruction corpora for these lightweight models.

Domain experiments showing strong results in law and finance after supervised fine-tuning and chain-of-thought (COT) style distillation.

Key Findings

MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.

NumbersMMLU 26.6 vs 24.1 (Table 5 / Table 7)

MindLLM-3B achieves high arithmetic performance compared to much larger models.

NumbersArithmetic 44.75 (MindLLM-3B) vs 28.26 (MPT-7B) and 37.82 (MOSS-16B) (Table 9)

Instruction tuning with a small, entropy‑filtered subset (50k) improves lightweight models more than tuning on million‑sample diverse corpora.

Numbers+1.5% MMLU and up to +1% C-Eval vs tuning on large sets (Section 5.2.1, Tables 13–14)

Shuffled (uniformly mixed) pretraining yields more stable loss and better long‑term performance than Blocked (per-source grouped) training.

NumbersShuffled shows steady loss decline and sustained 5‑shot/C‑Eval gains; Blocked shows later drops and forgetting (Section

Training bilingual models from scratch with a balanced mix beats monolingual then transfer for models ≤7B.

NumbersMindLLM bilingual‑from‑scratch (323B total tokens for 1.3B) shows better bilingual alignment and avoids transfer issues;

Chain‑of‑thought style distillation (COT) notably boosts small models in finance classification.

NumbersMindLLM-1.3B accuracy: 47.79% with COT vs 19.98% SFT (+27.8 pp) on the finance test (Table 26)

Results

MMLU (few-shot)

ValueMindLLM-1.3B 26.6

BaselineGPT-Neo-1.3B 24.1

Arithmetic

ValueMindLLM-3B 44.75

BaselineMPT-7B 28.26

C-Eval (few-shot)

ValueMindLLM-1.3B 26.1

BaselineOpen-LLaMA-7B 25.9

Instruction tuning benefit (small curated set)

ValueMMLU +1.5% (max), C-Eval +1% (max)

Baselinetuning on large diverse corpora

SFT

ValueMindLLM-1.3B COT 47.79% vs SFT 19.98%

BaselineMindLLM-1.3B SFT

Law domain Elo

ValueMindLLM-Law (3B) Elo 1623

BaselineChatGLM2 Elo 2329; Lawyer-Llama Elo 2153

Who Should Care

What To Try In 7 Days

Train or fine‑tune a 1–3B model on a balanced small bilingual corpus rather than trying monolingual then transfer.

Build a 50k entropy‑filtered instruction set (use pretraining model cross‑entropy) and test instruction tuning before scaling to millions of examples.

For domain classification, try rationale (COT) distillation from a larger model to boost small‑model accuracy quickly.

Agent Features

Tool Use

  • DeepSpeed
  • FlashAttention-2

Frameworks

  • Hugging Face Transformers
  • DeepSpeed
  • lm-evaluation-harness
  • OpenCompass

Architectures

  • decoder-only Transformer
  • RoPE (rotary positional embeddings)
  • GeGLU feed-forward

Optimization Features

Token Efficiency

  • SentencePiece BPE tokenizer (vocab 125,700)

Infra Optimization

  • DeepSpeed with ZeRO stage 1 for memory/training trade-offs

Model Optimization

  • RoPE positional encoding
  • RMSNorm
  • DeepNorm
  • GeGLU activation
  • FlashAttention-2

System Optimization

  • data shuffling vs blocked curricula recommendations
  • entropy-based instruction filtering to reduce tuning data

Training Optimization

  • bfloat16
  • AdamW optimizer
  • warm-up then linear decay schedule
  • ZeRO stage 1 (DeepSpeed)

Inference Optimization

  • FlashAttention-2 to reduce memory and ops

Reproducibility

Data Urls

  • Pile (public)
  • WuDao (open portion)
  • CBook (public link)

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Limited model capacity: 1–3B models trade off general reasoning and breadth for targeted strengths.
  • Some capabilities (complex reasoning, full math) still favor larger >7B models on average.
  • MindLLM-3B results are still under active training; some numbers may change as training continues.
  • No full public release of training code or final cleaned corpora reported in paper.

When Not To Use

  • If you need top-tier general reasoning or broad emergent abilities that appear at larger scales (>7B).
  • When you require guaranteed best performance on long-form code synthesis or advanced multi-step program execution.
  • If strict reproducibility is needed and you cannot access the exact cleaned pretraining corpus or training recipe.

Failure Modes

  • Forgetting when using Blocked (per-source) curricula during pretraining, seen as later performance drops.
  • Instruction tuning on huge, diverse datasets can reduce few-shot imitation ability and sometimes lower few-shot accuracy.
  • Capacity competition: bilingual pretraining consumes capacity and may reduce complex reasoning performance.

Core Entities

Models

  • MindLLM-1.3B
  • MindLLM-3B
  • GPT-Neo-1.3B
  • GPT-Neo-2.7B
  • GPT-J-6B
  • MPT-7B
  • Falcon-7B
  • Bloom-3B
  • Bloom-7B
  • MOSS-Base-16B
  • Open-LLaMA-3B
  • Open-LLaMA-7B
  • Baichuan2-7B

Metrics

  • Accuracy
  • perplexity
  • arithmetic score
  • Elo rating

Datasets

  • Pile
  • WuDao
  • CBook
  • Wanjuan (exam)
  • Tulu (instruction)
  • MingLi (Chinese instruction)
  • MOSS instruction set
  • EastMoney finance corpus

Benchmarks

  • MMLU
  • AGIEval (English)
  • C-Eval
  • CMMLU
  • Arithmetic
  • GSM8K
  • MATH
  • Flores-101