Imitating ChatGPT copies style, not capabilities

May 25, 20238 min

Overview

Decision SnapshotNeeds Validation

The study uses multiple base models, both human and automated evaluations, and public datasets, so evidence is moderately strong; results generalize to similar imitation approaches but cannot prove outcomes for all collection strategies or advanced imitation methods.

Citations50

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Imitation can cheaply copy a proprietary model's tone and safety but does not replicate its core reasoning or factual knowledge, so relying on imitation to match competitors is risky.

Who Should Care

Summary TLDR

Fine-tuning weaker open models on outputs from a strong proprietary model (ChatGPT) makes them mimic ChatGPT's style and instruction-following but does not meaningfully close capability gaps on factual, coding, or reasoning benchmarks. Task-specific imitation (train on task-like queries) helps; broad imitation from web-collected ChatGPT dialogues generally does not and can even hurt accuracy. Investing in stronger base models is more effective than collecting more imitation data.

Problem Statement

Can cheaply fine-tuning open-source LMs on outputs from a stronger proprietary LM (model imitation) produce comparable performance? The paper tests whether copying ChatGPT outputs closes the capability gap across tasks and whether human-style evaluations reflect real capability gains.

Main Contribution

Systematic study of imitation across base model sizes (1.5B–13B), data sources, and data scales (0.3M–150M tokens).

Show that broad imitation (ShareGPT-Mix) produces ChatGPT-like style but little to no benchmark gains, and sometimes degrades performance.

Key Findings

Human raters often prefer or rate imitation outputs equal to ChatGPT.

Numbers≈70% of imitation outputs rated equal/better vs ChatGPT

Practical UseDo not trust simple crowd preference tests alone; style and confident phrasing can fool raters into overrating capability.

Evidence RefFigure 1 (left)

Broad imitation on ShareGPT-Mix does not improve, and can lower, benchmark accuracy.

Numbers7B NQ: baseline 1710 after ShareGPT-Mix

Practical UseAvoid broad imitation if your goal is factual QA or reasoning; it can reduce task accuracy versus the base model.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Natural Questions (NQ) exact match7B baseline 17 / 7B+ShareGPT-Mix 10 / 7B+NQ-synth 22 / 13B baseline 20 / 13B+NQ-synth 27 / ChatGPT 317B baseline 177B+NQ-synth +5 vs base; 13B+NQ-synth +7 vs 13B baseNatural QuestionsTable 1 shows NQ exact match scores for base and finetuned models.Table 1
Human preference (blind pairwise)Imitation outputs ≈70% rated equal or better than ChatGPT on evaluated promptsChatGPT255 held-out prompts, 71 workersFigure 1 (left) and human evaluation description in Section 4Figure 1

What To Try In 7 Days

Run a small task-specific imitation experiment for a single critical task and compare benchmark accuracy to base model.

Compare blind human preference vs automatic accuracy to detect style-driven illusions of quality.

Measure toxicity and refusal behavior after imitation to see if safety posture transfers.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Unknown overlap between ChatGPT pretraining data and evaluated benchmarks could inflate ChatGPT performance.

Only supervised finetuning on teacher outputs was tested; methods like RLHF using the teacher were not evaluated.

When Not To Use

When you need real factual accuracy, reasoning, or coding improvements across many tasks.

When your base model is substantially weaker than the target model.

Failure Modes

Style mimicry: outputs sound authoritative but contain factual errors.

Distribution shift: conversational imitation can hurt performance on benchmark tasks.

Core Entities

Models

GPT-2 1.5BLLaMA 7BLLaMA 13BChatGPTGPT-4 (evaluator)

Metrics

Accuracyexact matchhuman preferenceGPT-4 preferencenontoxicity scoreunigram intersection

Datasets

ShareGPT-MixNQ-syntheticShareGPTHC3Discord ChatGPT Bots

Benchmarks

MMLUNatural Questions (NQ)HumanEvalRealToxicityPrompts

Context Entities

Models

VicunaAlpacaKoalaGPT4All

Datasets

SuperNaturalInstructions