Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
82
Why It Matters For Business
The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.
Summary TLDR
This paper presents DeepSeek LLM, trained from scratch on a bilingual 2 trillion token corpus and built using scaling‑law guided choices. Main technical points: fitted power laws for optimal batch size and learning rate with compute; represent model scale as non‑embedding FLOPs/token (M) to better predict optimal model/data split; dataset deduplication, filtering and remixing; and a two‑stage alignment pipeline (SFT then DPO). Results: DeepSeek-67B outperforms LLaMA-2-70B on many code and math benchmarks and reaches near‑GPT-3.5 open‑ended chat scores (MT-Bench 8.35 → 8.76 after DPO). Safety evaluations and human annotation show strong refusal/safety behavior (Do-Not-Answer 97.8). The paper:
Problem Statement
Open-source LLM builders lack clear, empirically reliable scaling rules and practical hyperparameter recipes that generalize across datasets and compute budgets. This makes it hard to decide how to split compute between model size and tokens, how to pick batch size and learning rate when scaling, and how data quality affects those choices. The paper seeks practical scaling laws and then uses them to train and align 7B and 67B open models on a 2T bilingual corpus.
Main Contribution
Empirical scaling rules for hyperparameters: fitted power laws that predict near‑optimal batch size (increases with compute) and learning rate (decreases with compute).
A new model‑scale representation: non‑embedding FLOPs/token (M) used instead of raw parameter counts to predict optimal model/data tradeoffs more accurately.
Evidence that pretraining data quality shifts optimal compute allocation: higher quality data favors allocating more compute to model size.
Released training recipe and infrastructure notes: tokenizer, multi‑step LR scheduler, GQA for 67B, ZeRO‑1, flash attention, bf16 training with fp32 gradient accumulation.
Built and evaluated DeepSeek models (7B and 67B), applied SFT and DPO alignment, and ran broad automatic and human evaluations including held‑out tests and safety checks.
Key Findings
Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.
Model scale represented as non‑embedding FLOPs/token (M) predicts generalization and optimal model/data split better than parameter counts.
Data quality shifts optimal compute allocation: better data → allocate more compute to model scaling (higher exponent a).
DeepSeek-67B matches or exceeds LLaMA‑2‑70B on many code, math, and reasoning benchmarks.
DPO alignment improves open‑ended chat scores with little harm to benchmark accuracy.
Human safety evaluation and out‑of‑distribution held‑out tests indicate robust behavior but remaining gaps vs SOTA.
Results
HumanEval (0-shot / base)
GSM8K (8‑shot / base)
MBPP (3‑shot)
MT‑Bench average (chat)
Do‑Not‑Answer safety score
Held‑out LeetCode (pass@1)
Who Should Care
What To Try In 7 Days
Run small iso‑FLOP experiments to fit batch size and LR following the paper's power‑law fits and avoid full grid search.
Compute non‑embedding FLOPs/token (M) for your model configs and use it to compare model/data tradeoffs.
Aggressively deduplicate corpora across dumps and measure dedup removal rate to improve data quality cheaply (they report 89.8% removal over 91 dumps).
Optimization Features
Token Efficiency
- Byte‑level BPE with 100k conventional tokens and pre‑tokenization to preserve CJK categories
- split numbers into digits to control vocabulary
Infra Optimization
- HAI‑LLM training framework integrating multiple parallelism strategies
- asynchronous frequent checkpointing (every 5 minutes) to limit lost work
Model Optimization
- GroupedQuery Attention (GQA) for 67B to reduce inference cost
- Depth increase instead of wider FFN at 67B
System Optimization
- Data/tensor/sequence parallelism and 1F1B pipeline
- overlap of communication and computation
Training Optimization
- Multi‑step learning rate scheduler (warmup then 80%/10%/10% decay)
- ZeRO‑1 optimizer state partitioning
- bf16 weights with fp32 gradient accumulation
- fused LayerNorm/GEMM/Adam updates to speed training
- reuse first stage to support continual training
Inference Optimization
- vLLM for generative tasks
- flash attention to improve hardware utilization
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data release not provided; reproducibility limited until code/data are published.
- Results rely on a large proprietary 2T bilingual corpus; gains may not transfer to different languages or smaller datasets.
- Held‑out tests show gap vs top proprietary models on hard coding contest problems (LeetCode).
- Some benchmark improvements may be format‑dependent (e.g., adding multiple‑choice data can overfit MC tasks).
When Not To Use
- When you need immediate, fully open reproducibility (data and code not yet released).
- If your deployment budget cannot support heavy pretraining or large token budgets.
- When your target language is outside Chinese/English — the model is optimized for bilingual data.
Failure Modes
- Overfitting to multiple‑choice style data if such data is included in pretraining or SFT.
- Repetition behavior rising when excessive math/code SFT is used on weaker models.
- Underperformance on fine‑grained or language‑specific tasks outside training distribution.
Core Entities
Models
- DeepSeek LLM 7B
- DeepSeek LLM 67B
- DeepSeek LLM 67B Chat
- DeepSeek LLM 67B Chat DPO
- LLaMA‑2 7B
- LLaMA‑2 70B
- GPT‑3.5
- GPT‑4
Metrics
- Accuracy
- pass@1
- HumanEval pass@1
- bits‑per‑byte (BPB)
- MT‑Bench average score
- Do‑Not‑Answer score
- safety pass counts
Datasets
- DeepSeek 2T bilingual corpus (proprietary)
- Common Crawl
- OpenWebText2
- Pile
- GSM8K
- HumanEval
- MBPP
- MMLU
- C‑Eval
- AlignBench
- MT‑Bench
- Do‑Not‑Answer
Benchmarks
- HumanEval
- MBPP
- GSM8K
- MATH
- MMLU
- BBH
- AGIEval
- MT‑Bench
- AlignBench
- LeetCode Weekly Contests (held‑out)
- Do‑Not‑Answer
Context Entities
Models
- CodeLlama
- StarCoder
- Codex
- ChatGLM3
- Baichuan2
- Qwen
Metrics
- Do‑Not‑Answer
- AlignBench score
Datasets
- RedPajama (cited)
- OpenWebText2 (cited)
Benchmarks
- AlignBench (Chinese open‑ended)
- MT‑Bench (English open‑ended)

