Overview
The paper offers practical, tested scaling and hyperparameter recipes and real model results; expect good guidance for planning compute and data but limited reproducibility until code/data artifacts are published.
Citations82
Evidence Strength0.70
Confidence0.78
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.
Who Should Care
Summary TLDR
This paper presents DeepSeek LLM, trained from scratch on a bilingual 2 trillion token corpus and built using scaling‑law guided choices. Main technical points: fitted power laws for optimal batch size and learning rate with compute; represent model scale as non‑embedding FLOPs/token (M) to better predict optimal model/data split; dataset deduplication, filtering and remixing; and a two‑stage alignment pipeline (SFT then DPO). Results: DeepSeek-67B outperforms LLaMA-2-70B on many code and math benchmarks and reaches near‑GPT-3.5 open‑ended chat scores (MT-Bench 8.35 → 8.76 after DPO). Safety evaluations and human annotation show strong refusal/safety behavior (Do-Not-Answer 97.8). The paper:
Problem Statement
Open-source LLM builders lack clear, empirically reliable scaling rules and practical hyperparameter recipes that generalize across datasets and compute budgets. This makes it hard to decide how to split compute between model size and tokens, how to pick batch size and learning rate when scaling, and how data quality affects those choices. The paper seeks practical scaling laws and then uses them to train and align 7B and 67B open models on a 2T bilingual corpus.
Main Contribution
Empirical scaling rules for hyperparameters: fitted power laws that predict near‑optimal batch size (increases with compute) and learning rate (decreases with compute).
A new model‑scale representation: non‑embedding FLOPs/token (M) used instead of raw parameter counts to predict optimal model/data tradeoffs more accurately.
Key Findings
Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.
Model scale represented as non‑embedding FLOPs/token (M) predicts generalization and optimal model/data split better than parameter counts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| HumanEval (0-shot / base) | DeepSeek 67B base 42.7% | LLaMA‑2 70B 28.7% | +14.0 pp | HumanEval | Table 5 (base model comparison) | Sec.5.1, Table 5 |
| GSM8K (8‑shot / base) | DeepSeek 67B 63.4% | LLaMA‑2 70B 58.4% | +5.0 pp | GSM8K | Table 5 (base comparisons) | Sec.5.1, Table 5 |
What To Try In 7 Days
Run small iso‑FLOP experiments to fit batch size and LR following the paper's power‑law fits and avoid full grid search.
Compute non‑embedding FLOPs/token (M) for your model configs and use it to compare model/data tradeoffs.
Aggressively deduplicate corpora across dumps and measure dedup removal rate to improve data quality cheaply (they report 89.8% removal over 91 dumps).
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Data release not provided; reproducibility limited until code/data are published.
Results rely on a large proprietary 2T bilingual corpus; gains may not transfer to different languages or smaller datasets.
When Not To Use
When you need immediate, fully open reproducibility (data and code not yet released).
If your deployment budget cannot support heavy pretraining or large token budgets.
Failure Modes
Overfitting to multiple‑choice style data if such data is included in pretraining or SFT.
Repetition behavior rising when excessive math/code SFT is used on weaker models.

