DeepSeek: scaling recipes and a 2T‑token bilingual pretraining run that yields 7B and 67B models competitive on code, math, and chat

January 5, 202411 min

Overview

Decision SnapshotNeeds Validation

The paper offers practical, tested scaling and hyperparameter recipes and real model results; expect good guidance for planning compute and data but limited reproducibility until code/data artifacts are published.

Citations82

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, Yuheng Zou

Links

Abstract / PDF

Why It Matters For Business

The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.

Who Should Care

Summary TLDR

This paper presents DeepSeek LLM, trained from scratch on a bilingual 2 trillion token corpus and built using scaling‑law guided choices. Main technical points: fitted power laws for optimal batch size and learning rate with compute; represent model scale as non‑embedding FLOPs/token (M) to better predict optimal model/data split; dataset deduplication, filtering and remixing; and a two‑stage alignment pipeline (SFT then DPO). Results: DeepSeek-67B outperforms LLaMA-2-70B on many code and math benchmarks and reaches near‑GPT-3.5 open‑ended chat scores (MT-Bench 8.35 → 8.76 after DPO). Safety evaluations and human annotation show strong refusal/safety behavior (Do-Not-Answer 97.8). The paper:

Problem Statement

Open-source LLM builders lack clear, empirically reliable scaling rules and practical hyperparameter recipes that generalize across datasets and compute budgets. This makes it hard to decide how to split compute between model size and tokens, how to pick batch size and learning rate when scaling, and how data quality affects those choices. The paper seeks practical scaling laws and then uses them to train and align 7B and 67B open models on a 2T bilingual corpus.

Main Contribution

Empirical scaling rules for hyperparameters: fitted power laws that predict near‑optimal batch size (increases with compute) and learning rate (decreases with compute).

A new model‑scale representation: non‑embedding FLOPs/token (M) used instead of raw parameter counts to predict optimal model/data tradeoffs more accurately.

Key Findings

Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.

Numbersnear‑optimal region defined as ≤0.25% above min loss; fitted across 1e17–2e19 FLOPs

Practical UseWhen scaling, increase batch size and lower LR following the fitted formula — this yields near‑optimal training without heavy grid search.

Evidence RefSec.3.1, Fig.3

Model scale represented as non‑embedding FLOPs/token (M) predicts generalization and optimal model/data split better than parameter counts.

NumbersTable 3: 6N1 and 6N2 differ from M by up to ~50% at small scales

Practical UseUse FLOPs/token (M) instead of raw param counts when planning compute allocation between model size and token budget for more accurate forecasts.

Evidence RefSec.3.2, Table 3, App A.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HumanEval (0-shot / base)DeepSeek 67B base 42.7%LLaMA‑2 70B 28.7%+14.0 ppHumanEvalTable 5 (base model comparison)Sec.5.1, Table 5
GSM8K (8‑shot / base)DeepSeek 67B 63.4%LLaMA‑2 70B 58.4%+5.0 ppGSM8KTable 5 (base comparisons)Sec.5.1, Table 5

What To Try In 7 Days

Run small iso‑FLOP experiments to fit batch size and LR following the paper's power‑law fits and avoid full grid search.

Compute non‑embedding FLOPs/token (M) for your model configs and use it to compare model/data tradeoffs.

Aggressively deduplicate corpora across dumps and measure dedup removal rate to improve data quality cheaply (they report 89.8% removal over 91 dumps).

Optimization Features

Token Efficiency
Byte‑level BPE with 100k conventional tokens and pre‑tokenization to preserve CJK categoriessplit numbers into digits to control vocabulary
Infra Optimization
HAI‑LLM training framework integrating multiple parallelism strategiesasynchronous frequent checkpointing (every 5 minutes) to limit lost work
Model Optimization
GroupedQuery Attention (GQA) for 67B to reduce inference costDepth increase instead of wider FFN at 67B
System Optimization
Data/tensor/sequence parallelism and 1F1B pipelineoverlap of communication and computation
Training Optimization
Multi‑step learning rate scheduler (warmup then 80%/10%/10% decay)ZeRO‑1 optimizer state partitioningbf16 weights with fp32 gradient accumulationfused LayerNorm/GEMM/Adam updates to speed trainingreuse first stage to support continual training
Inference Optimization
vLLM for generative tasksflash attention to improve hardware utilization

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Data release not provided; reproducibility limited until code/data are published.

Results rely on a large proprietary 2T bilingual corpus; gains may not transfer to different languages or smaller datasets.

When Not To Use

When you need immediate, fully open reproducibility (data and code not yet released).

If your deployment budget cannot support heavy pretraining or large token budgets.

Failure Modes

Overfitting to multiple‑choice style data if such data is included in pretraining or SFT.

Repetition behavior rising when excessive math/code SFT is used on weaker models.

Core Entities

Models

DeepSeek LLM 7BDeepSeek LLM 67BDeepSeek LLM 67B ChatDeepSeek LLM 67B Chat DPOLLaMA‑2 7BLLaMA‑2 70BGPT‑3.5GPT‑4

Metrics

Accuracypass@1HumanEval pass@1bits‑per‑byte (BPB)MT‑Bench average scoreDo‑Not‑Answer scoresafety pass counts

Datasets

DeepSeek 2T bilingual corpus (proprietary)Common CrawlOpenWebText2PileGSM8KHumanEvalMBPPMMLUC‑EvalAlignBenchMT‑BenchDo‑Not‑Answer

Benchmarks

HumanEvalMBPPGSM8KMATHMMLUBBHAGIEvalMT‑BenchAlignBenchLeetCode Weekly Contests (held‑out)Do‑Not‑Answer

Context Entities

Models

CodeLlamaStarCoderCodexChatGLM3Baichuan2Qwen

Metrics

Do‑Not‑AnswerAlignBench score

Datasets

RedPajama (cited)OpenWebText2 (cited)

Benchmarks

AlignBench (Chinese open‑ended)MT‑Bench (English open‑ended)