Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
A small, curated instruction dataset can cheaply improve Chinese LLM reasoning; you can boost model utility without retraining on massive new corpora.
Summary TLDR
Panda LLM adapts LLaMA models to Chinese by pretraining on public Chinese corpora and then instruction-tuning on a small, diverse Chinese instruction set (COIG). Instruction-tuning on just 4.2% of samples raised reasoning accuracy on evaluated benchmarks (LogiQA-v2, C3) by 4–13 points. The team releases model diffs and code (not full LLaMA weights) and describes a two-stage training pipeline and training recipes.
Problem Statement
Open-source Chinese instruction-following models are scarce and it is unclear how dataset mix and instruction tuning affect performance. The authors aim to show which training-data choices move open Chinese LLMs forward and to publish models/code to help others reproduce and extend results.
Main Contribution
A two-stage recipe: pretrain/fine-tune on public Chinese corpora, then instruction-tune on COIG.
A comparative evaluation of open-source Chinese LLMs on reasoning benchmarks (LogiQA-v2, C3).
Release of model parameter differences (deltas) and training code to enable reuse under LLaMA license constraints.
Key Findings
Instruction-tuning on COIG raised reasoning scores across benchmarks.
A small fraction of instruction data produced the largest gains.
Mixing instruction and non-instruction data indiscriminately can hurt instruction-following performance.
Results
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run instruction-tuning on a compact, domain-diverse instruction set (like COIG) and up-sample it.
Use the released model deltas and conversion script to adapt LLaMA weights if you hold them.
Evaluate on representative reasoning QA sets (LogiQA-v2, C3) to validate gains quickly.
Optimization Features
Infra Optimization
- Trained on AWS nodes with 16 A100-80G GPUs
- Batch accumulation used to reach large effective batch sizes
System Optimization
- DeepSpeed ZERO-1 for memory efficiency
- bfloat16 precision and gradient checkpointing to reduce memory
Training Optimization
- Two-stage: large pretraining mix then separate instruction-tuning
- Up-sampling COIG to emphasize instruction examples
Reproducibility
Data Urls
- https://github.com/dandelionsllm/pandallm/tree/main/conf/llama/zh
- COIG (cited as Zhang et al., 2023) and public Chinese corpora listed in paper
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Full model weights not released due to LLaMA license; only parameter diffs provided.
- Evaluation focuses on reasoning QA benchmarks only (LogiQA-v2, C3).
- Mixing instruction and non-instruction data can degrade results if not sequenced carefully.
When Not To Use
- When you need fully open, standalone model weights (they are not provided).
- When your target tasks are far from the evaluated reasoning benchmarks.
- If you cannot access original LLaMA base weights to apply deltas.
Failure Modes
- Combining instruction and non-instruction data without staging can reduce instruction-following ability.
- Relying only on pretraining without instruction data yields poor instruction-following performance.
- Performance claims limited to evaluated datasets; other tasks may not improve similarly.
Core Entities
Models
- Panda-7B
- Panda-13B (planned)
- Panda-33B (planned)
- Panda-65B (planned)
- LLaMA-7B
- LLaMA-13B
- LLaMA-33B
- LLaMA-65B
Metrics
- Accuracy
Datasets
- COIG
- Chinese-Wiki-2019
- Chinese-News-2016
- Chinese-Baike-2018
- Chinese-Webtext-2019
- Translation-2019
- NLP Chinese Corpus (mixture)
Benchmarks
- LogiQA-v2
- C3 (C3-d Dialogue, C3-m Mixed)

