Panda LLM: small, diverse Chinese instruction data (4.2%) sharply boosts LLaMA-based model reasoning

May 4, 20235 min

Overview

Decision SnapshotNeeds Validation

The paper demonstrates a clear, reproducible boost from instruction-tuning on a small, diverse Chinese instruction dataset, but evaluation is limited to a few reasoning benchmarks and full model weights are not published.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Fangkai Jiao, Bosheng Ding, Tianze Luo, Zhanfeng Mo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A small, curated instruction dataset can cheaply improve Chinese LLM reasoning; you can boost model utility without retraining on massive new corpora.

Who Should Care

Summary TLDR

Panda LLM adapts LLaMA models to Chinese by pretraining on public Chinese corpora and then instruction-tuning on a small, diverse Chinese instruction set (COIG). Instruction-tuning on just 4.2% of samples raised reasoning accuracy on evaluated benchmarks (LogiQA-v2, C3) by 4–13 points. The team releases model diffs and code (not full LLaMA weights) and describes a two-stage training pipeline and training recipes.

Problem Statement

Open-source Chinese instruction-following models are scarce and it is unclear how dataset mix and instruction tuning affect performance. The authors aim to show which training-data choices move open Chinese LLMs forward and to publish models/code to help others reproduce and extend results.

Main Contribution

A two-stage recipe: pretrain/fine-tune on public Chinese corpora, then instruction-tune on COIG.

A comparative evaluation of open-source Chinese LLMs on reasoning benchmarks (LogiQA-v2, C3).

Key Findings

Instruction-tuning on COIG raised reasoning scores across benchmarks.

NumbersLogiQA: 27.4131.93 (+4.52); C3-d: 43.0247.30 (+4.28); C3-m: 43.6657.04 (+13.38)

Practical UseAfter heavy pretraining, allocate a small, diverse instruction dataset and run targeted instruction-tuning to get immediate reasoning gains.

Evidence RefTable 4; Section 3.2

A small fraction of instruction data produced the largest gains.

NumbersCOIG is 4.2% of training samples; C3-m gain = +13.38 points

Practical UseYou can up-sample and prioritize a compact, diverse instruction set rather than only scaling generic pretraining data.

Evidence RefTable 3; Section 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyPanda-7B 27.41%; Panda-Instruct-7B-9k 31.93%Linly-Chinese-LLaMA-7b-hf 25.91%+4.52LogiQA-v2Table 4 reports these accuracies for baseline and instruct-tuned models.Table 4; Section 3.2
AccuracyPanda-7B 43.02%; Panda-Instruct-7B-9k 47.30%belle-llama-ext-7b 29.52%+4.28C3-dTable 4 shows instruction-tuned model improves C3-d accuracy.Table 4; Section 3.2

What To Try In 7 Days

Run instruction-tuning on a compact, domain-diverse instruction set (like COIG) and up-sample it.

Use the released model deltas and conversion script to adapt LLaMA weights if you hold them.

Evaluate on representative reasoning QA sets (LogiQA-v2, C3) to validate gains quickly.

Optimization Features

Infra Optimization
Trained on AWS nodes with 16 A100-80G GPUsBatch accumulation used to reach large effective batch sizes
System Optimization
DeepSpeed ZERO-1 for memory efficiencybfloat16 precision and gradient checkpointing to reduce memory
Training Optimization
Two-stage: large pretraining mix then separate instruction-tuningUp-sampling COIG to emphasize instruction examples

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/dandelionsllm/pandallm/tree/main/conf/llama/zhCOIG (cited as Zhang et al., 2023) and public Chinese corpora listed in paper

Risks & Boundaries

Limitations

Full model weights not released due to LLaMA license; only parameter diffs provided.

Evaluation focuses on reasoning QA benchmarks only (LogiQA-v2, C3).

When Not To Use

When you need fully open, standalone model weights (they are not provided).

When your target tasks are far from the evaluated reasoning benchmarks.

Failure Modes

Combining instruction and non-instruction data without staging can reduce instruction-following ability.

Relying only on pretraining without instruction data yields poor instruction-following performance.

Core Entities

Models

Panda-7BPanda-13B (planned)Panda-33B (planned)Panda-65B (planned)LLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65B

Metrics

Accuracy

Datasets

COIGChinese-Wiki-2019Chinese-News-2016Chinese-Baike-2018Chinese-Webtext-2019Translation-2019NLP Chinese Corpus (mixture)

Benchmarks

LogiQA-v2C3 (C3-d Dialogue, C3-m Mixed)