LLM-created training data hides biases and artifacts that can degrade models and amplify majority views

January 26, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper compiles broad, multi-dataset evidence that synthetic data often differs from human data and can harm downstream models; experiments are varied but not exhaustive, so treat findings as well-supported but preliminary.

Citations2

Evidence Strength0.75

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 55%

Authors

Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Synthetic LLM data can cut labeling costs but risks amplifying majority views, injecting errors, and reducing downstream accuracy (~10% in some tests). Validate and human-check synthetic data before production use.

Who Should Care

Summary TLDR

The authors collect and stress-test five types of LLM-generated text (task labels, preferences, instructions, simulations, free-form text). They find consistent artifacts: LLMs over-represent majority labels, prefer local lexical cues in preference judgments, emit high error rates in synthetic instructions (~50% in some datasets), role-flip and digress in agent simulations, and produce more formal/more metaphorical free text than humans. Training downstream models on such artificial data can lower accuracy (≈10% lower on human test sets for preference-trained models) and amplify biases. The paper releases data and code and gives practical mitigation advice.

Problem Statement

LLM-generated text is increasingly used to label, evaluate, and augment datasets. This paper asks whether state-of-the-art synthetic data matches human data across five data types, what artifacts it contains, and whether training on such data degrades downstream model performance.

Main Contribution

Collected and organized a broad suite of LLM-generated data covering five types: task labels, preferences, instructions, simulations, and free-form text.

Systematically stress-tested first-order properties (distributional and stylistic differences) and second-order effects (retraining downstream models) across existing benchmarks.

Key Findings

Models trained on LLM-generated preferences perform worse on human preference tests.

Numbers≈10% lower accuracy on human test sets (Table 8)

Practical UseDo not substitute human preference labels with synthetic preferences without validating on human-held tests; expect ~10% accuracy loss.

Evidence Ref§6.2 Table 8

LLM task labels over-represent majority views and under-represent minority opinions.

NumbersRole-flipped/majority skew increases after fine-tuning (label distribution plots)

Practical UseWhen using LLM-generated labels, check per-class distributions and sample minority labels for human verification to avoid amplifying majority bias.

Evidence Ref§5.1, §5.2 Figure 8

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy≈10% lower when trained on LLM preferences (human→LLM mismatch)Model trained on human preferences≈-10% accuracy on human test setP2C, COBBLER (Table 8)Models trained on LLM preference data have ~10% lower accuracy on human test sets§6.2 Table 8
Instruction-tuned model performance (Rouge F1 / Cos Sim)DOLLY: 0.139 / 0.339 vs CLEANED ALPACA: 0.121 / 0.304 on FLANHuman-generated DOLLYCLEANED ALPACA lower by 0.018 Rouge F1FLAN 2021 (Table 10)Human-generated DOLLY scored higher than some synthetic datasets on FLAN§7.2 Table 10

What To Try In 7 Days

Sample-check: compare a small human-labeled holdout against synthetic labels and report per-class drift.

Train a quick reward model on synthetic preferences and test on human preferences to measure real-world gap.

Scan instruction datasets for obvious corruptions (programmatic flips, missing inputs) and remove 50% corrupt sample candidates.

Agent Features

Tool Use
Multi-agent simulation (CAMEL, SPP)
Frameworks
CAMELSOLO PERFORMANCE PROMPTING
Architectures
Transformer LLMs (GPT-family, LLaMA variants)
Collaboration
Simulated role-based conversation frameworks (CAMEL)

Reproducibility

Risks & Boundaries

Limitations

Study uses publicly available, heterogeneous datasets and does not exhaustively sweep models, prompts, or hyperparameters.

Human validation and qualitative labels are subject to annotator subjectivity.

When Not To Use

Do not replace human preference labels with synthetic ones without evaluation on human-held tests.

Avoid sole reliance on synthetic instruction data for closed, fact-based QA without aggressive filtering.

Failure Modes

Amplification of majority bias in task labels leading to minority erasure.

Locality bias in preference judgments (overweighting lexicons/entailment).

Core Entities

Models

GPT-3.5-TurboChatGPTGPT-4VicunaKoalaBaizeLLaMA2Llama 2RoBERTa

Metrics

AccuracyF1RoUGE F1cosine similarityPearson rlog-odds vs probabilitylabel distribution

Datasets

SOCIAL CHEMISTRYSENTIMENT (Díaz et al.)SBICGHC (Gab Hate Corpus)P2CCOBBLERSELF-INSTRUCTUNNATURAL INSTRUCTIONSCLEANED-ALPACAGPT-4-LLMDOLLYSUPERNATURAL INSTRUCTIONSFLAN 2021HC3SCARECROWDEEPFAKEWORKERSCAMELSPP (SOLO PERFORMANCE PROMPTING)

Benchmarks

FLAN 2021MNLI (used for entailment)DynaSent / P2CCOBBLER

Context Entities

Models

VicunaKoalaBaizeLLaMA2GPT-3.5GPT-4

Datasets

HC3SCARECROWDEEPFAKEWORKERS