LLM-created training data hides biases and artifacts that can degrade models and amplify majority views

Overview

Decision SnapshotNeeds Validation

The paper compiles broad, multi-dataset evidence that synthetic data often differs from human data and can harm downstream models; experiments are varied but not exhaustive, so treat findings as well-supported but preliminary.

Citations2

Evidence Strength0.75

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 55%

Authors

Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Synthetic LLM data can cut labeling costs but risks amplifying majority views, injecting errors, and reducing downstream accuracy (~10% in some tests). Validate and human-check synthetic data before production use.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

The authors collect and stress-test five types of LLM-generated text (task labels, preferences, instructions, simulations, free-form text). They find consistent artifacts: LLMs over-represent majority labels, prefer local lexical cues in preference judgments, emit high error rates in synthetic instructions (~50% in some datasets), role-flip and digress in agent simulations, and produce more formal/more metaphorical free text than humans. Training downstream models on such artificial data can lower accuracy (≈10% lower on human test sets for preference-trained models) and amplify biases. The paper releases data and code and gives practical mitigation advice.

Problem Statement

LLM-generated text is increasingly used to label, evaluate, and augment datasets. This paper asks whether state-of-the-art synthetic data matches human data across five data types, what artifacts it contains, and whether training on such data degrades downstream model performance.

Main Contribution

Collected and organized a broad suite of LLM-generated data covering five types: task labels, preferences, instructions, simulations, and free-form text.

Systematically stress-tested first-order properties (distributional and stylistic differences) and second-order effects (retraining downstream models) across existing benchmarks.

Key Findings

Models trained on LLM-generated preferences perform worse on human preference tests.

Numbers≈10% lower accuracy on human test sets (Table 8)

Practical UseDo not substitute human preference labels with synthetic preferences without validating on human-held tests; expect ~10% accuracy loss.

Evidence Ref§6.2 Table 8

LLM task labels over-represent majority views and under-represent minority opinions.

NumbersRole-flipped/majority skew increases after fine-tuning (label distribution plots)

Practical UseWhen using LLM-generated labels, check per-class distributions and sample minority labels for human verification to avoid amplifying majority bias.

Evidence Ref§5.1, §5.2 Figure 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈10% lower when trained on LLM preferences (human→LLM mismatch)	Model trained on human preferences	≈-10% accuracy on human test set	P2C, COBBLER (Table 8)	Models trained on LLM preference data have ~10% lower accuracy on human test sets	§6.2 Table 8
Instruction-tuned model performance (Rouge F1 / Cos Sim)	DOLLY: 0.139 / 0.339 vs CLEANED ALPACA: 0.121 / 0.304 on FLAN	Human-generated DOLLY	CLEANED ALPACA lower by 0.018 Rouge F1	FLAN 2021 (Table 10)	Human-generated DOLLY scored higher than some synthetic datasets on FLAN	§7.2 Table 10

What To Try In 7 Days

Sample-check: compare a small human-labeled holdout against synthetic labels and report per-class drift.

Train a quick reward model on synthetic preferences and test on human preferences to measure real-world gap.

Scan instruction datasets for obvious corruptions (programmatic flips, missing inputs) and remove 50% corrupt sample candidates.

Agent Features

Tool Use

Multi-agent simulation (CAMEL, SPP)

Frameworks

CAMELSOLO PERFORMANCE PROMPTING

Architectures

Transformer LLMs (GPT-family, LLaMA variants)

Collaboration

Simulated role-based conversation frameworks (CAMEL)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://minnesotanlp.github.io/artifact https://huggingface.co/datasets/minnesotanlp/LLM-Artifacts

Data URLs

https://huggingface.co/datasets/minnesotanlp/LLM-Artifacts https://minnesotanlp.github.io/artifact

Risks & Boundaries

Limitations

Study uses publicly available, heterogeneous datasets and does not exhaustively sweep models, prompts, or hyperparameters.

Human validation and qualitative labels are subject to annotator subjectivity.

When Not To Use

Do not replace human preference labels with synthetic ones without evaluation on human-held tests.

Avoid sole reliance on synthetic instruction data for closed, fact-based QA without aggressive filtering.

Failure Modes

Amplification of majority bias in task labels leading to minority erasure.

Locality bias in preference judgments (overweighting lexicons/entailment).

Core Entities

Models

GPT-3.5-TurboChatGPTGPT-4VicunaKoalaBaizeLLaMA2Llama 2RoBERTa

Metrics

AccuracyF1RoUGE F1cosine similarityPearson rlog-odds vs probabilitylabel distribution

Datasets

SOCIAL CHEMISTRYSENTIMENT (Díaz et al.)SBICGHC (Gab Hate Corpus)P2CCOBBLERSELF-INSTRUCTUNNATURAL INSTRUCTIONSCLEANED-ALPACAGPT-4-LLMDOLLYSUPERNATURAL INSTRUCTIONSFLAN 2021HC3SCARECROWDEEPFAKEWORKERSCAMELSPP (SOLO PERFORMANCE PROMPTING)

Benchmarks

FLAN 2021MNLI (used for entailment)DynaSent / P2CCOBBLER

Context Entities

Models

VicunaKoalaBaizeLLaMA2GPT-3.5GPT-4

Datasets

HC3SCARECROWDEEPFAKEWORKERS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models trained on LLM-generated preferences perform worse on human preference tests.

LLM task labels over-represent majority views and under-represent minority opinions.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding