Panda LLM: small, diverse Chinese instruction data (4.2%) sharply boosts LLaMA-based model reasoning

May 4, 20235 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

4

Authors

Fangkai Jiao, Bosheng Ding, Tianze Luo, Zhanfeng Mo

Links

Abstract / PDF

Why It Matters For Business

A small, curated instruction dataset can cheaply improve Chinese LLM reasoning; you can boost model utility without retraining on massive new corpora.

Summary TLDR

Panda LLM adapts LLaMA models to Chinese by pretraining on public Chinese corpora and then instruction-tuning on a small, diverse Chinese instruction set (COIG). Instruction-tuning on just 4.2% of samples raised reasoning accuracy on evaluated benchmarks (LogiQA-v2, C3) by 4–13 points. The team releases model diffs and code (not full LLaMA weights) and describes a two-stage training pipeline and training recipes.

Problem Statement

Open-source Chinese instruction-following models are scarce and it is unclear how dataset mix and instruction tuning affect performance. The authors aim to show which training-data choices move open Chinese LLMs forward and to publish models/code to help others reproduce and extend results.

Main Contribution

A two-stage recipe: pretrain/fine-tune on public Chinese corpora, then instruction-tune on COIG.

A comparative evaluation of open-source Chinese LLMs on reasoning benchmarks (LogiQA-v2, C3).

Release of model parameter differences (deltas) and training code to enable reuse under LLaMA license constraints.

Key Findings

Instruction-tuning on COIG raised reasoning scores across benchmarks.

NumbersLogiQA: 27.41 → 31.93 (+4.52); C3-d: 43.02 → 47.30 (+4.28); C3-m: 43.66 → 57.04 (+13.38)

A small fraction of instruction data produced the largest gains.

NumbersCOIG is 4.2% of training samples; C3-m gain = +13.38 points

Mixing instruction and non-instruction data indiscriminately can hurt instruction-following performance.

Results

Accuracy

ValuePanda-7B 27.41%; Panda-Instruct-7B-9k 31.93%

BaselineLinly-Chinese-LLaMA-7b-hf 25.91%

Accuracy

ValuePanda-7B 43.02%; Panda-Instruct-7B-9k 47.30%

Baselinebelle-llama-ext-7b 29.52%

Accuracy

ValuePanda-7B 43.66%; Panda-Instruct-7B-9k 57.04%

Baselinebelle-llama-ext-7b 28.87%

Who Should Care

What To Try In 7 Days

Run instruction-tuning on a compact, domain-diverse instruction set (like COIG) and up-sample it.

Use the released model deltas and conversion script to adapt LLaMA weights if you hold them.

Evaluate on representative reasoning QA sets (LogiQA-v2, C3) to validate gains quickly.

Optimization Features

Infra Optimization

  • Trained on AWS nodes with 16 A100-80G GPUs
  • Batch accumulation used to reach large effective batch sizes

System Optimization

  • DeepSpeed ZERO-1 for memory efficiency
  • bfloat16 precision and gradient checkpointing to reduce memory

Training Optimization

  • Two-stage: large pretraining mix then separate instruction-tuning
  • Up-sampling COIG to emphasize instruction examples

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Full model weights not released due to LLaMA license; only parameter diffs provided.
  • Evaluation focuses on reasoning QA benchmarks only (LogiQA-v2, C3).
  • Mixing instruction and non-instruction data can degrade results if not sequenced carefully.

When Not To Use

  • When you need fully open, standalone model weights (they are not provided).
  • When your target tasks are far from the evaluated reasoning benchmarks.
  • If you cannot access original LLaMA base weights to apply deltas.

Failure Modes

  • Combining instruction and non-instruction data without staging can reduce instruction-following ability.
  • Relying only on pretraining without instruction data yields poor instruction-following performance.
  • Performance claims limited to evaluated datasets; other tasks may not improve similarly.

Core Entities

Models

  • Panda-7B
  • Panda-13B (planned)
  • Panda-33B (planned)
  • Panda-65B (planned)
  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-33B
  • LLaMA-65B

Metrics

  • Accuracy

Datasets

  • COIG
  • Chinese-Wiki-2019
  • Chinese-News-2016
  • Chinese-Baike-2018
  • Chinese-Webtext-2019
  • Translation-2019
  • NLP Chinese Corpus (mixture)

Benchmarks

  • LogiQA-v2
  • C3 (C3-d Dialogue, C3-m Mixed)