QDAIF: use LMs to both generate and judge to evolve diverse, high‑quality text

October 19, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

7

Authors

Herbie Bradley, Andrew Dai, Hannah Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Grégory Schott, Joel Lehman

Links

Abstract / PDF

Why It Matters For Business

QDAIF automates generation plus subjective evaluation so teams can produce many distinct, human‑preferred text options without custom heuristics or expensive human labeling; useful for creative briefs, A/B content pools, and synthetic data generation.

Summary TLDR

The paper introduces QDAIF, a practical recipe that combines a quality‑diversity search algorithm (MAP‑Elites) with large language models used as both mutation operators (LMX) and automatic evaluators (AI feedback). Across creative writing tasks (opinions, short stories, poetry), QDAIF fills more niches and returns more human-preferred variety than baseline prompting or quality‑only search. Human studies show roughly 73% agreement between AI and a single annotator and higher alignment where annotators agree. Main limits: reward‑hacking at very high model scores and the need to specify which diversity axes to search.

Problem Statement

Quality‑diversity (QD) search finds many high‑quality options, but traditional QD needs explicit numeric measures of quality and diversity. That blocks QD from subjective domains like creative writing. Can off‑the‑shelf LMs be used to generate variation and to judge both quality and qualitative diversity so MAP‑Elites can run in human‑style domains?

Main Contribution

QDAIF: extend MAP‑Elites to call LMs for mutation (LMX / LMX‑rewrite) and for evaluating both quality and diversity in natural language.

Show QDAIF beats baselines on creative writing tasks (Opinions, Stories, Poetry) by measuring QD score and human evaluation.

Practical design choices: use LM logits as continuous measures, non‑uniform binning to counter LM calibration, and instruction‑guided rewrite mutation for poetry with GPT‑4.

Key Findings

QDAIF sets scored higher in human-assessed quality‑diversity than most baselines.

NumbersHuman QD score: QDAIF 0.772 vs Random-Search 0.606 (Table 1)

For poetry, QDAIF + GPT‑4 produced much larger QD scores than naive baselines.

NumbersPoetry QD score: QDAIF 130 (CI 118–145) vs Random‑Poems 76 (CI 67–85)

AI feedback labels align with humans most of the time, but high LM confidence can mislead.

NumbersAI–human agreement ~73% overall; up to 95% when annotators agree and label is non‑neutral (Appendix A.1)

Results

Human QD score (avg across three domains)

ValueQDAIF 0.772; Fixed-Few-Shot 0.767; Random-Search 0.606

BaselineFixed-Few-Shot, Random-Search

Poetry QD score (25 bins)

ValueQDAIF 130 (CI 118–145)

BaselineRandom-Poems 76 (CI 67–85); Fixed Seed Rewrite 99 (CI 72–117)

AI–Human label agreement

Value≈73% overall; 82% when two annotators agree; 95% on non‑neutral agreed samples

Code domain diversity (sorting algorithms)

ValueQDAIF found 53% non‑bubble sorts; Random‑Code 5% non‑bubble

BaselineRandom‑Code

Who Should Care

What To Try In 7 Days

Prototype QDAIF: run MAP‑Elites with your LM as mutator and evaluator on a small prompt (e.g., 20 bins for tone).

Use few‑shot LMX mutation: seed 3 exemplars and iterate; compare to simple few‑shot sampling.

Validate AI feedback against a small human panel (10–20 samples) to detect reward‑hacking patterns.

Agent Features

Memory

  • archive (MAP‑Elites bins and bin depth) acts as long‑term memory

Planning

  • iterative evolutionary loop (MAP‑Elites) selecting and mutating elites

Tool Use

  • LMs for generation and evaluation
  • few‑shot prompt pools (LMX)
  • instruction rewrite operator (LMX‑rewrite)

Frameworks

  • OpenELM
  • LMX
  • MAP‑Elites

Is Agentic

true

Architectures

  • LM mutation + MAP‑Elites (evolutionary archive)
  • LM evaluator (finetuned adapter 70B; GPT‑4 for poetry)

Collaboration

  • human-in-the-loop for validation (human evaluation experiments)

Optimization Features

Token Efficiency

  • few‑shot prompts use 3 exemplars by default to limit context cost

Model Optimization

  • adapter finetuning for AI feedback model (70B adapter)

System Optimization

  • archive depth to collect finetuning samples; non‑uniform binning to reduce wasted evaluations

Training Optimization

  • mix of task clusters for instruction tuning (FLAN, P3 style datasets)

Inference Optimization

  • reuse of prompt pools and archive elites to bias few‑shot prompts

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Reward‑hacking: very high AI scores (fitness ≈ 1) sometimes do not match human quality (Figure 5).
  • Requires explicit diversity axes; QDAIF will not discover axes you never defined.
  • Compute and API cost when using large evaluators (GPT‑4) at scale.
  • Evaluator calibration matters; non‑uniform binning and prompt design needed per axis.

When Not To Use

  • Safety‑critical domains that need provable factual correctness (medical, legal).
  • When you cannot specify or validate diversity axes and lack human evaluators for checks.
  • If API cost or latency from large LMs is prohibitive.

Failure Modes

  • Search exploits evaluator quirks to maximize AI score with low human value (reward hacking).
  • Archive fills with low‑quality or off‑topic elites if initialization is poor (zero‑shot init issues).
  • LM calibration mismatch causes poor bin coverage unless binning is adapted.
  • Single‑model evaluation blind spots; correlated model errors reduce robustness.

Core Entities

Models

  • GPT-4
  • GPT-3.5-Turbo
  • luminousbase (13B, 30B, 70B)

Metrics

  • QD score (sum of best fitness per bin)
  • Human QD score (sum of human quality per identified diversity categories)
  • Human quality Likert rating
  • AI–Human agreement
  • Coverage (fraction of bins filled)
  • Best solution quality

Datasets

  • HumanEval (used for code domain, task #88)
  • Human evaluation annotations collected by authors (blind study, Appendix A.1)

Benchmarks

  • Custom QD tasks: Opinions, Stories (Genre / Ending / 2D), Poetry (genre×tone)
  • Poetry QD score benchmark (25 bins, quality 1–10)