Imitating ChatGPT copies style, not capabilities

Overview

Decision SnapshotNeeds Validation

The study uses multiple base models, both human and automated evaluations, and public datasets, so evidence is moderately strong; results generalize to similar imitation approaches but cannot prove outcomes for all collection strategies or advanced imitation methods.

Citations50

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Imitation can cheaply copy a proprietary model's tone and safety but does not replicate its core reasoning or factual knowledge, so relying on imitation to match competitors is risky.

Who Should Care

CTO Product Manager ML Engineer Founder

Summary TLDR

Fine-tuning weaker open models on outputs from a strong proprietary model (ChatGPT) makes them mimic ChatGPT's style and instruction-following but does not meaningfully close capability gaps on factual, coding, or reasoning benchmarks. Task-specific imitation (train on task-like queries) helps; broad imitation from web-collected ChatGPT dialogues generally does not and can even hurt accuracy. Investing in stronger base models is more effective than collecting more imitation data.

Problem Statement

Can cheaply fine-tuning open-source LMs on outputs from a stronger proprietary LM (model imitation) produce comparable performance? The paper tests whether copying ChatGPT outputs closes the capability gap across tasks and whether human-style evaluations reflect real capability gains.

Main Contribution

Systematic study of imitation across base model sizes (1.5B–13B), data sources, and data scales (0.3M–150M tokens).

Show that broad imitation (ShareGPT-Mix) produces ChatGPT-like style but little to no benchmark gains, and sometimes degrades performance.

Key Findings

Human raters often prefer or rate imitation outputs equal to ChatGPT.

Numbers≈70% of imitation outputs rated equal/better vs ChatGPT

Practical UseDo not trust simple crowd preference tests alone; style and confident phrasing can fool raters into overrating capability.

Evidence RefFigure 1 (left)

Broad imitation on ShareGPT-Mix does not improve, and can lower, benchmark accuracy.

Numbers7B NQ: baseline 17 → 10 after ShareGPT-Mix

Practical UseAvoid broad imitation if your goal is factual QA or reasoning; it can reduce task accuracy versus the base model.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Natural Questions (NQ) exact match	7B baseline 17 / 7B+ShareGPT-Mix 10 / 7B+NQ-synth 22 / 13B baseline 20 / 13B+NQ-synth 27 / ChatGPT 31	7B baseline 17	7B+NQ-synth +5 vs base; 13B+NQ-synth +7 vs 13B base	Natural Questions	Table 1 shows NQ exact match scores for base and finetuned models.	Table 1
Human preference (blind pairwise)	Imitation outputs ≈70% rated equal or better than ChatGPT on evaluated prompts	ChatGPT	—	255 held-out prompts, 71 workers	Figure 1 (left) and human evaluation description in Section 4	Figure 1

What To Try In 7 Days

Run a small task-specific imitation experiment for a single critical task and compare benchmark accuracy to base model.

Compare blind human preference vs automatic accuracy to detect style-driven illusions of quality.

Measure toxicity and refusal behavior after imitation to see if safety posture transfers.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/young-geng/EasyLM

Data URLs

https://huggingface.co/young-geng/koala-eval

Risks & Boundaries

Limitations

Unknown overlap between ChatGPT pretraining data and evaluated benchmarks could inflate ChatGPT performance.

Only supervised finetuning on teacher outputs was tested; methods like RLHF using the teacher were not evaluated.

When Not To Use

When you need real factual accuracy, reasoning, or coding improvements across many tasks.

When your base model is substantially weaker than the target model.

Failure Modes

Style mimicry: outputs sound authoritative but contain factual errors.

Distribution shift: conversational imitation can hurt performance on benchmark tasks.

Core Entities

Models

GPT-2 1.5BLLaMA 7BLLaMA 13BChatGPTGPT-4 (evaluator)

Metrics

Accuracyexact matchhuman preferenceGPT-4 preferencenontoxicity scoreunigram intersection

Datasets

ShareGPT-MixNQ-syntheticShareGPTHC3Discord ChatGPT Bots

Benchmarks

MMLUNatural Questions (NQ)HumanEvalRealToxicityPrompts

Context Entities

Models

VicunaAlpacaKoalaGPT4All

Datasets

SuperNaturalInstructions

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Human raters often prefer or rate imitation outputs equal to ChatGPT.

Broad imitation on ShareGPT-Mix does not improve, and can lower, benchmark accuracy.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Use a small assistant LLM to remove teacher-model favoritism from proxy judge training

Key finding

Use synthetic crowd comparisons to make LLM judges give deeper, more reliable chain-of-thought evaluations

Key finding