Large blinded study: LLMs' ideas judged more novel than experts but slightly less feasible

September 6, 20249 min

Overview

Decision SnapshotNeeds Validation

Evidence is a controlled human study showing consistent novelty gains for AI ideas, but results are limited to prompting topics, depend on expert review, and face evaluator and diversity limits.

Citations41

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 70%

Authors

Chenglei Si, Diyi Yang, Tatsunori Hashimoto

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM ideation can quickly surface novel research directions, but outputs need human vetting for feasibility and implementation; blindly trusting LLM ideas or LLM-only evaluation risks wasted effort.

Who Should Care

Summary TLDR

The authors ran a controlled, blind human study comparing research ideas from a retrieval-augmented LLM ideation agent with ideas written by 49 NLP experts and reviewed by 79 experts. Across ~300 reviews, AI-generated ideas scored higher on novelty (statistically significant, p<0.05) while showing a small, non-significant drop in feasibility. The paper also shows practical limits: LLMs repeat many candidate ideas despite massive overgeneration, and current LLM-based evaluators fall short of human reviewers. The authors release agent code and review scores.

Problem Statement

Can current LLMs produce novel, expert-level research ideas comparable to human researchers? The paper measures ideation quality with a large, controlled blind study and inspects both agent design and evaluation limits.

Main Contribution

A large, controlled blind evaluation comparing 49 human ideas and AI-generated ideas reviewed by 79 expert reviewers (N=298 reviews).

A practical LLM ideation pipeline (RAG retrieval, massive overgeneration, deduplication, pairwise LLM ranking) and public release of code and review scores.

Key Findings

AI-generated ideas were rated more novel than human experts.

NumbersNovelty: Human 4.84 vs AI 5.64 (110 scale); p<0.01 (Test 1)

Practical UseUse LLM ideation to surface fresh directions quickly, but validate feasibility before committing resources.

Evidence RefTable 7; Test 1 novelty scores

Feasibility of AI ideas trended lower but not significantly different in this study.

NumbersFeasibility: Human 6.61 vs AI 6.34 (110); difference not statistically significant

Practical UseExpect some extra engineering effort to make AI-suggested ideas executable; don't treat high novelty as proof of easy execution.

Evidence RefTable 7; feasibility scores

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Novelty (mean score)Human 4.84; AI 5.64; AI+HumanRerank 5.81 (110)Human ideasAI +0.8; AI+Rerank +1.0 vs HumanAll reviews aggregated; Test 1Table 7 Test 1; two-tailed Welch t-tests with BonferroniTable 7
Feasibility (mean score)Human 6.61; AI 6.34; AI+HumanRerank 6.44 (110)Human ideasAI −0.27 vs Human (not significant)All reviews aggregated; Test 1Table 7 feasibility rowTable 7

What To Try In 7 Days

Run an ideation pilot: use RAG + overgeneration but limit to 200–500 candidates, then human-rerank top picks.

Add a human-in-the-loop reranking step instead of relying on LLM ranking alone.

Measure idea duplication: embed candidates and remove near-duplicates to save downstream review time.

Agent Features

Memory
No persistent long-term memory; prompts include generated titles to avoid repeats
Planning
Overgenerate many seed ideas (4,000 per topic)Deduplicate then expand top seeds into full proposals
Tool Use
Semantic Scholar API for paper retrievalSentence-Transformers all-MiniLM-L6-v2 for deduplication
Frameworks
RAG + inference-time scaling (overgeneration + rerank)
Is Agentic

Yes

Architectures
Retrieval-Augmented Generation (RAG)Pairwise LLM ranking (Swiss-system tournament)
Collaboration
Human reranking of top AI ideas (author manually selected top ideas)Style anonymizer to normalize writing style

Optimization Features

Token Efficiency
Generate short seed ideas to explore more candidates under token limits
Inference Optimization
Inference-time scaling by massive sampling (4,000 seeds)Pairwise ranking tournament to reduce reliance on raw scores

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Study focuses only on prompting-based NLP topics; results may not generalize to other fields.

Human idea writers often provided median-level ideas under time limits—may not represent each expert's best work.

When Not To Use

When you need fully implemented, verified experiments rather than just ideas.

When idea diversity is critical and a narrow set of approaches would be harmful.

Failure Modes

High duplicate rate when scaling generation, wasting compute and reviewer time.

Vague or under-specified implementation details in AI ideas leading to low execution success.

Core Entities

Models

Claude-3.5-SonnetClaude-3-OpusGPT-4oClaude-3.5 pairwise ranker

Metrics

Novelty (1–10)Excitement (1–10)Feasibility (1–10)Expected effectiveness (1–10)Overall score (1–10)Accuracy

Datasets

ICLR 2024 scraped submissions (1.2K LLM-related) used for ranker proxyHuman-collected expert ideas (N=49 per condition)Reviewer scores dataset (N=298 reviews)

Benchmarks

Balanced top/bottom idea ranking (used to evaluate LLM evaluators)

Context Entities

Models

GPT-3.5 (comparative mentions)LLaMA variants mentioned in examples

Metrics

AccuracyIdea deduplication cosine similarity threshold (0.8)

Datasets

FLORES-200, GSM8K, HumanEval (mentioned in example proposals)

Benchmarks

Reviewer agreement baselines: NeurIPS'21 (66%), ICLR'24 (71.9%)