Large blinded study: LLMs' ideas judged more novel than experts but slightly less feasible

Overview

Decision SnapshotNeeds Validation

Evidence is a controlled human study showing consistent novelty gains for AI ideas, but results are limited to prompting topics, depend on expert review, and face evaluator and diversity limits.

Citations41

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 70%

Authors

Chenglei Si, Diyi Yang, Tatsunori Hashimoto

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM ideation can quickly surface novel research directions, but outputs need human vetting for feasibility and implementation; blindly trusting LLM ideas or LLM-only evaluation risks wasted effort.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Founder

Summary TLDR

The authors ran a controlled, blind human study comparing research ideas from a retrieval-augmented LLM ideation agent with ideas written by 49 NLP experts and reviewed by 79 experts. Across ~300 reviews, AI-generated ideas scored higher on novelty (statistically significant, p<0.05) while showing a small, non-significant drop in feasibility. The paper also shows practical limits: LLMs repeat many candidate ideas despite massive overgeneration, and current LLM-based evaluators fall short of human reviewers. The authors release agent code and review scores.

Problem Statement

Can current LLMs produce novel, expert-level research ideas comparable to human researchers? The paper measures ideation quality with a large, controlled blind study and inspects both agent design and evaluation limits.

Main Contribution

A large, controlled blind evaluation comparing 49 human ideas and AI-generated ideas reviewed by 79 expert reviewers (N=298 reviews).

A practical LLM ideation pipeline (RAG retrieval, massive overgeneration, deduplication, pairwise LLM ranking) and public release of code and review scores.

Key Findings

AI-generated ideas were rated more novel than human experts.

NumbersNovelty: Human 4.84 vs AI 5.64 (1–10 scale); p<0.01 (Test 1)

Practical UseUse LLM ideation to surface fresh directions quickly, but validate feasibility before committing resources.

Evidence RefTable 7; Test 1 novelty scores

Feasibility of AI ideas trended lower but not significantly different in this study.

NumbersFeasibility: Human 6.61 vs AI 6.34 (1–10); difference not statistically significant

Practical UseExpect some extra engineering effort to make AI-suggested ideas executable; don't treat high novelty as proof of easy execution.

Evidence RefTable 7; feasibility scores

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Novelty (mean score)	Human 4.84; AI 5.64; AI+HumanRerank 5.81 (1–10)	Human ideas	AI +0.8; AI+Rerank +1.0 vs Human	All reviews aggregated; Test 1	Table 7 Test 1; two-tailed Welch t-tests with Bonferroni	Table 7
Feasibility (mean score)	Human 6.61; AI 6.34; AI+HumanRerank 6.44 (1–10)	Human ideas	AI −0.27 vs Human (not significant)	All reviews aggregated; Test 1	Table 7 feasibility row	Table 7

What To Try In 7 Days

Run an ideation pilot: use RAG + overgeneration but limit to 200–500 candidates, then human-rerank top picks.

Add a human-in-the-loop reranking step instead of relying on LLM ranking alone.

Measure idea duplication: embed candidates and remove near-duplicates to save downstream review time.

Agent Features

Memory

No persistent long-term memory; prompts include generated titles to avoid repeats

Planning

Overgenerate many seed ideas (4,000 per topic)Deduplicate then expand top seeds into full proposals

Tool Use

Semantic Scholar API for paper retrievalSentence-Transformers all-MiniLM-L6-v2 for deduplication

Frameworks

RAG + inference-time scaling (overgeneration + rerank)

Is Agentic

Yes

Architectures

Retrieval-Augmented Generation (RAG)Pairwise LLM ranking (Swiss-system tournament)

Collaboration

Human reranking of top AI ideas (author manually selected top ideas)Style anonymizer to normalize writing style

Optimization Features

Token Efficiency

Generate short seed ideas to explore more candidates under token limits

Inference Optimization

Inference-time scaling by massive sampling (4,000 seeds)Pairwise ranking tournament to reduce reliance on raw scores

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NoviScl/AI-Researcher

Data URLs

https://github.com/NoviScl/AI-Researcher (human review scores and agent outputs)

Risks & Boundaries

Limitations

Study focuses only on prompting-based NLP topics; results may not generalize to other fields.

Human idea writers often provided median-level ideas under time limits—may not represent each expert's best work.

When Not To Use

When you need fully implemented, verified experiments rather than just ideas.

When idea diversity is critical and a narrow set of approaches would be harmful.

Failure Modes

High duplicate rate when scaling generation, wasting compute and reviewer time.

Vague or under-specified implementation details in AI ideas leading to low execution success.

Core Entities

Models

Claude-3.5-SonnetClaude-3-OpusGPT-4oClaude-3.5 pairwise ranker

Metrics

Novelty (1–10)Excitement (1–10)Feasibility (1–10)Expected effectiveness (1–10)Overall score (1–10)Accuracy

Datasets

ICLR 2024 scraped submissions (1.2K LLM-related) used for ranker proxyHuman-collected expert ideas (N=49 per condition)Reviewer scores dataset (N=298 reviews)

Benchmarks

Balanced top/bottom idea ranking (used to evaluate LLM evaluators)

Context Entities

Models

GPT-3.5 (comparative mentions)LLaMA variants mentioned in examples

Metrics

AccuracyIdea deduplication cosine similarity threshold (0.8)

Datasets

FLORES-200, GSM8K, HumanEval (mentioned in example proposals)

Benchmarks

Reviewer agreement baselines: NeurIPS'21 (66%), ICLR'24 (71.9%)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AI-generated ideas were rated more novel than human experts.

Feasibility of AI ideas trended lower but not significantly different in this study.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding