LLM-created training data hides biases and artifacts that can degrade models and amplify majority views

January 26, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.55

Cost Impact Score

0.7

Citation Count

2

Authors

Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang

Links

Abstract / PDF

Why It Matters For Business

Synthetic LLM data can cut labeling costs but risks amplifying majority views, injecting errors, and reducing downstream accuracy (~10% in some tests). Validate and human-check synthetic data before production use.

Summary TLDR

The authors collect and stress-test five types of LLM-generated text (task labels, preferences, instructions, simulations, free-form text). They find consistent artifacts: LLMs over-represent majority labels, prefer local lexical cues in preference judgments, emit high error rates in synthetic instructions (~50% in some datasets), role-flip and digress in agent simulations, and produce more formal/more metaphorical free text than humans. Training downstream models on such artificial data can lower accuracy (≈10% lower on human test sets for preference-trained models) and amplify biases. The paper releases data and code and gives practical mitigation advice.

Problem Statement

LLM-generated text is increasingly used to label, evaluate, and augment datasets. This paper asks whether state-of-the-art synthetic data matches human data across five data types, what artifacts it contains, and whether training on such data degrades downstream model performance.

Main Contribution

Collected and organized a broad suite of LLM-generated data covering five types: task labels, preferences, instructions, simulations, and free-form text.

Systematically stress-tested first-order properties (distributional and stylistic differences) and second-order effects (retraining downstream models) across existing benchmarks.

Documented recurring artifacts, showed cases where synthetic data degrades or skews model behavior, and proposed practical mitigation steps and data-documentation calls to action.

Key Findings

Models trained on LLM-generated preferences perform worse on human preference tests.

Numbers≈10% lower accuracy on human test sets (Table 8)

LLM task labels over-represent majority views and under-represent minority opinions.

NumbersRole-flipped/majority skew increases after fine-tuning (label distribution plots)

Some synthetic instruction datasets have very high error rates.

NumbersError rates close to 50% reported for UNNATURAL INSTRUCTIONS and SELF-INSTRUCT

Simulated multi-agent conversations show role-flipping and digression that harm task accuracy.

NumbersRole-flipped conversations avg length 20.1 vs dataset avg 14.4; digression reduces average accuracy from 72.4% to 53.0%

Free-form LLM text tends to be more formal and uses more metaphor than human text.

NumbersWORKERS formality: human 63.6% vs machine 91.1% (Table 14)

Results

Accuracy

Value≈10% lower when trained on LLM preferences (human→LLM mismatch)

BaselineModel trained on human preferences

Instruction-tuned model performance (Rouge F1 / Cos Sim)

ValueDOLLY: 0.139 / 0.339 vs CLEANED ALPACA: 0.121 / 0.304 on FLAN

BaselineHuman-generated DOLLY

Role-flipping incidence & conversation length

ValueRole-flipped convs avg 20.1 messages vs dataset avg 14.4

BaselineAll simulated conversations

Accuracy

ValueAccuracy 72.4% (no digression) vs 53.0% (predicted digression)

BaselineConversations without digression

Who Should Care

What To Try In 7 Days

Sample-check: compare a small human-labeled holdout against synthetic labels and report per-class drift.

Train a quick reward model on synthetic preferences and test on human preferences to measure real-world gap.

Scan instruction datasets for obvious corruptions (programmatic flips, missing inputs) and remove 50% corrupt sample candidates.

Agent Features

Tool Use

  • Multi-agent simulation (CAMEL, SPP)

Frameworks

  • CAMEL
  • SOLO PERFORMANCE PROMPTING

Architectures

  • Transformer LLMs (GPT-family, LLaMA variants)

Collaboration

  • Simulated role-based conversation frameworks (CAMEL)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Study uses publicly available, heterogeneous datasets and does not exhaustively sweep models, prompts, or hyperparameters.
  • Human validation and qualitative labels are subject to annotator subjectivity.
  • Does not explore latest prompting or chain-of-thought methods systematically.
  • Findings describe sampled datasets and may not generalize to all LLMs or generation settings.

When Not To Use

  • Do not replace human preference labels with synthetic ones without evaluation on human-held tests.
  • Avoid sole reliance on synthetic instruction data for closed, fact-based QA without aggressive filtering.
  • Avoid using long un-vetted simulated conversations directly for mission-critical agent training.

Failure Modes

  • Amplification of majority bias in task labels leading to minority erasure.
  • Locality bias in preference judgments (overweighting lexicons/entailment).
  • Instruction-tuned models hallucinate more if trained on error-prone synthetic instructions.
  • Agent simulations suffer role-flips and digression that reduce task accuracy.

Core Entities

Models

  • GPT-3.5-Turbo
  • ChatGPT
  • GPT-4
  • Vicuna
  • Koala
  • Baize
  • LLaMA2
  • Llama 2
  • RoBERTa

Metrics

  • Accuracy
  • F1
  • RoUGE F1
  • cosine similarity
  • Pearson r
  • log-odds vs probability
  • label distribution

Datasets

  • SOCIAL CHEMISTRY
  • SENTIMENT (Díaz et al.)
  • SBIC
  • GHC (Gab Hate Corpus)
  • P2C
  • COBBLER
  • SELF-INSTRUCT
  • UNNATURAL INSTRUCTIONS
  • CLEANED-ALPACA
  • GPT-4-LLM
  • DOLLY
  • SUPERNATURAL INSTRUCTIONS
  • FLAN 2021
  • HC3
  • SCARECROW
  • DEEPFAKE
  • WORKERS
  • CAMEL
  • SPP (SOLO PERFORMANCE PROMPTING)

Benchmarks

  • FLAN 2021
  • MNLI (used for entailment)
  • DynaSent / P2C
  • COBBLER

Context Entities

Models

  • Vicuna
  • Koala
  • Baize
  • LLaMA2
  • GPT-3.5
  • GPT-4

Datasets

  • HC3
  • SCARECROW
  • DEEPFAKE
  • WORKERS