Panda LLM: small, diverse Chinese instruction data (4.2%) sharply boosts LLaMA-based model reasoning

Overview

Decision SnapshotNeeds Validation

The paper demonstrates a clear, reproducible boost from instruction-tuning on a small, diverse Chinese instruction dataset, but evaluation is limited to a few reasoning benchmarks and full model weights are not published.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Fangkai Jiao, Bosheng Ding, Tianze Luo, Zhanfeng Mo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A small, curated instruction dataset can cheaply improve Chinese LLM reasoning; you can boost model utility without retraining on massive new corpora.

Who Should Care

CTO ML Engineer Data Scientist Product Manager Founder

Summary TLDR

Panda LLM adapts LLaMA models to Chinese by pretraining on public Chinese corpora and then instruction-tuning on a small, diverse Chinese instruction set (COIG). Instruction-tuning on just 4.2% of samples raised reasoning accuracy on evaluated benchmarks (LogiQA-v2, C3) by 4–13 points. The team releases model diffs and code (not full LLaMA weights) and describes a two-stage training pipeline and training recipes.

Problem Statement

Open-source Chinese instruction-following models are scarce and it is unclear how dataset mix and instruction tuning affect performance. The authors aim to show which training-data choices move open Chinese LLMs forward and to publish models/code to help others reproduce and extend results.

Main Contribution

A two-stage recipe: pretrain/fine-tune on public Chinese corpora, then instruction-tune on COIG.

A comparative evaluation of open-source Chinese LLMs on reasoning benchmarks (LogiQA-v2, C3).

Key Findings

Instruction-tuning on COIG raised reasoning scores across benchmarks.

NumbersLogiQA: 27.41 → 31.93 (+4.52); C3-d: 43.02 → 47.30 (+4.28); C3-m: 43.66 → 57.04 (+13.38)

Practical UseAfter heavy pretraining, allocate a small, diverse instruction dataset and run targeted instruction-tuning to get immediate reasoning gains.

Evidence RefTable 4; Section 3.2

A small fraction of instruction data produced the largest gains.

NumbersCOIG is 4.2% of training samples; C3-m gain = +13.38 points

Practical UseYou can up-sample and prioritize a compact, diverse instruction set rather than only scaling generic pretraining data.

Evidence RefTable 3; Section 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Panda-7B 27.41%; Panda-Instruct-7B-9k 31.93%	Linly-Chinese-LLaMA-7b-hf 25.91%	+4.52	LogiQA-v2	Table 4 reports these accuracies for baseline and instruct-tuned models.	Table 4; Section 3.2
Accuracy	Panda-7B 43.02%; Panda-Instruct-7B-9k 47.30%	belle-llama-ext-7b 29.52%	+4.28	C3-d	Table 4 shows instruction-tuned model improves C3-d accuracy.	Table 4; Section 3.2

What To Try In 7 Days

Run instruction-tuning on a compact, domain-diverse instruction set (like COIG) and up-sample it.

Use the released model deltas and conversion script to adapt LLaMA weights if you hold them.

Evaluate on representative reasoning QA sets (LogiQA-v2, C3) to validate gains quickly.

Optimization Features

Infra Optimization

Trained on AWS nodes with 16 A100-80G GPUsBatch accumulation used to reach large effective batch sizes

System Optimization

DeepSpeed ZERO-1 for memory efficiencybfloat16 precision and gradient checkpointing to reduce memory

Training Optimization

Two-stage: large pretraining mix then separate instruction-tuningUp-sampling COIG to emphasize instruction examples

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/dandelionsllm/pandallm/

Data URLs

https://github.com/dandelionsllm/pandallm/tree/main/conf/llama/zhCOIG (cited as Zhang et al., 2023) and public Chinese corpora listed in paper

Risks & Boundaries

Limitations

Full model weights not released due to LLaMA license; only parameter diffs provided.

Evaluation focuses on reasoning QA benchmarks only (LogiQA-v2, C3).

When Not To Use

When you need fully open, standalone model weights (they are not provided).

When your target tasks are far from the evaluated reasoning benchmarks.

Failure Modes

Combining instruction and non-instruction data without staging can reduce instruction-following ability.

Relying only on pretraining without instruction data yields poor instruction-following performance.

Core Entities

Models

Panda-7BPanda-13B (planned)Panda-33B (planned)Panda-65B (planned)LLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65B

Metrics

Accuracy

Datasets

COIGChinese-Wiki-2019Chinese-News-2016Chinese-Baike-2018Chinese-Webtext-2019Translation-2019NLP Chinese Corpus (mixture)

Benchmarks

LogiQA-v2C3 (C3-d Dialogue, C3-m Mixed)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction-tuning on COIG raised reasoning scores across benchmarks.

A small fraction of instruction data produced the largest gains.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding