Large blinded study: LLMs' ideas judged more novel than experts but slightly less feasible

September 6, 20249 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

41

Authors

Chenglei Si, Diyi Yang, Tatsunori Hashimoto

Links

Abstract / PDF

Why It Matters For Business

LLM ideation can quickly surface novel research directions, but outputs need human vetting for feasibility and implementation; blindly trusting LLM ideas or LLM-only evaluation risks wasted effort.

Summary TLDR

The authors ran a controlled, blind human study comparing research ideas from a retrieval-augmented LLM ideation agent with ideas written by 49 NLP experts and reviewed by 79 experts. Across ~300 reviews, AI-generated ideas scored higher on novelty (statistically significant, p<0.05) while showing a small, non-significant drop in feasibility. The paper also shows practical limits: LLMs repeat many candidate ideas despite massive overgeneration, and current LLM-based evaluators fall short of human reviewers. The authors release agent code and review scores.

Problem Statement

Can current LLMs produce novel, expert-level research ideas comparable to human researchers? The paper measures ideation quality with a large, controlled blind study and inspects both agent design and evaluation limits.

Main Contribution

A large, controlled blind evaluation comparing 49 human ideas and AI-generated ideas reviewed by 79 expert reviewers (N=298 reviews).

A practical LLM ideation pipeline (RAG retrieval, massive overgeneration, deduplication, pairwise LLM ranking) and public release of code and review scores.

A clear empirical finding: AI ideas were rated significantly more novel than human ideas while being slightly lower on feasibility.

Detailed analysis of method limits: lack of idea diversity at scale and unreliable LLM-as-judge performance.

Key Findings

AI-generated ideas were rated more novel than human experts.

NumbersNovelty: Human 4.84 vs AI 5.64 (1–10 scale); p<0.01 (Test 1)

Feasibility of AI ideas trended lower but not significantly different in this study.

NumbersFeasibility: Human 6.61 vs AI 6.34 (1–10); difference not statistically significant

LLM evaluators underperform human reviewers as judges of idea quality.

NumbersBest LLM evaluator (Claude-3.5 pairwise): 53.3% vs human inter-reviewer 56.1% (balanced accuracy)

Massive overgeneration yields few unique ideas—LLMs lack collective diversity.

Numbers4000 generated seeds per topic → ~200 non-duplicate ideas (~5%)

Results

Novelty (mean score)

ValueHuman 4.84; AI 5.64; AI+HumanRerank 5.81 (1–10)

BaselineHuman ideas

Feasibility (mean score)

ValueHuman 6.61; AI 6.34; AI+HumanRerank 6.44 (1–10)

BaselineHuman ideas

Accuracy

ValueClaude-3.5 pairwise 53.3%

BaselineHuman inter-reviewer consistency 56.1%

Idea deduplication

Value≈4000 seeds → ≈200 non-duplicate ideas (~5%)

Who Should Care

What To Try In 7 Days

Run an ideation pilot: use RAG + overgeneration but limit to 200–500 candidates, then human-rerank top picks.

Add a human-in-the-loop reranking step instead of relying on LLM ranking alone.

Measure idea duplication: embed candidates and remove near-duplicates to save downstream review time.

Agent Features

Memory

  • No persistent long-term memory; prompts include generated titles to avoid repeats

Planning

  • Overgenerate many seed ideas (4,000 per topic)
  • Deduplicate then expand top seeds into full proposals

Tool Use

  • Semantic Scholar API for paper retrieval
  • Sentence-Transformers all-MiniLM-L6-v2 for deduplication

Frameworks

  • RAG + inference-time scaling (overgeneration + rerank)

Is Agentic

true

Architectures

  • Retrieval-Augmented Generation (RAG)
  • Pairwise LLM ranking (Swiss-system tournament)

Collaboration

  • Human reranking of top AI ideas (author manually selected top ideas)
  • Style anonymizer to normalize writing style

Optimization Features

Token Efficiency

  • Generate short seed ideas to explore more candidates under token limits

Inference Optimization

  • Inference-time scaling by massive sampling (4,000 seeds)
  • Pairwise ranking tournament to reduce reliance on raw scores

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Study focuses only on prompting-based NLP topics; results may not generalize to other fields.
  • Human idea writers often provided median-level ideas under time limits—may not represent each expert's best work.
  • LLM ideation shows low collective diversity: many generated seeds are near-duplicates.
  • LLM-based evaluators remain unreliable compared to human reviewers and can introduce bias.
  • Agent uses proprietary LLMs for ranking/generation (Claude variants), which affects reproducibility and cost.

When Not To Use

  • When you need fully implemented, verified experiments rather than just ideas.
  • When idea diversity is critical and a narrow set of approaches would be harmful.
  • When you must rely on automated LLM evaluation without human oversight.

Failure Modes

  • High duplicate rate when scaling generation, wasting compute and reviewer time.
  • Vague or under-specified implementation details in AI ideas leading to low execution success.
  • LLM rankers promoting spurious but superficially attractive ideas.
  • Overreliance on AI evaluators that trade variance for systematic bias.

Core Entities

Models

  • Claude-3.5-Sonnet
  • Claude-3-Opus
  • GPT-4o
  • Claude-3.5 pairwise ranker

Metrics

  • Novelty (1–10)
  • Excitement (1–10)
  • Feasibility (1–10)
  • Expected effectiveness (1–10)
  • Overall score (1–10)
  • Accuracy

Datasets

  • ICLR 2024 scraped submissions (1.2K LLM-related) used for ranker proxy
  • Human-collected expert ideas (N=49 per condition)
  • Reviewer scores dataset (N=298 reviews)

Benchmarks

  • Balanced top/bottom idea ranking (used to evaluate LLM evaluators)

Context Entities

Models

  • GPT-3.5 (comparative mentions)
  • LLaMA variants mentioned in examples

Metrics

  • Accuracy
  • Idea deduplication cosine similarity threshold (0.8)

Datasets

  • FLORES-200, GSM8K, HumanEval (mentioned in example proposals)

Benchmarks

  • Reviewer agreement baselines: NeurIPS'21 (66%), ICLR'24 (71.9%)