Overview
The paper gives thorough empirical numbers on 8k papers and compares three clear context strategies, but extraction precision is still low for production-grade leaderboard population.
Citations1
Evidence Strength0.75
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
License: Data derived from PwC (CC BY-SA); Mistral reported Apache-2.0
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Feeding models only the right paper sections speeds up extraction, reduces hallucinations, and improves accuracy for leaderboard curation—lowering manual review and infrastructure cost.
Who Should Care
Summary TLDR
The paper studies how which parts of a research paper you feed to an LLM affects automatic extraction of (Task, Dataset, Metric, Score) tuples for leaderboards. They build a 8k-paper corpus from community annotations, finetune 7B LLMs (Mistral and Llama-2) with FLAN-style instructions using QLoRA, and compare three context choices: DocTAET (title+abstract+experiments+tables), DocREC (results+experiments+conclusion), and DocFULL (full paper). Key practical findings: DocTAET best for detecting whether a paper has a leaderboard and for structured summary metrics; DocREC gives the best fine-grained element extraction; DocFULL performs poorly and increases hallucinations. Results are measured by
Problem Statement
Automatic creation of AI leaderboards means extracting (Task, Dataset, Metric, Score) tuples from papers. Existing NLI approaches require fixed taxonomies and struggle to adapt. This paper asks: which parts of a paper (context) should an instruction‑finetuned LLM see to maximize accuracy and reduce hallucination when generating leaderboards?
Main Contribution
A new corpus of ~8k papers with (Task, Dataset, Metric, Score) annotations reconstructed from community (PwC) exports and arXiv sources.
A controlled comparison of three context-selection strategies: DocTAET (targeted sections), DocREC (results/experiments/conclusion), and DocFULL (entire paper).
Key Findings
Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.
A results-focused context (DocREC) gives the best fine-grained extraction of individual elements (task, dataset, metric, score).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Structured summary quality (ROUGE-1, few-shot) | Mistral-7B DocTAET: 57.24 | — | — | Test few-shot (DocTAET) | Table 2: Mistral-7B ROUGE-1 few-shot = 57.24 | Table 2 |
| Accuracy | Mistral-7B DocTAET: ~89% (few-shot), ~95% (zero-shot) | — | — | Test few-shot / zero-shot | Table 2: General Accuracy values reported for Mistral-7B DocTAET | Section 4.1, Table 2 |
What To Try In 7 Days
Reproduce one pipeline: extract DocTAET sections, finetune a 7B model with QLoRA on a small sample, measure accuracy vs. full-text baseline.
For high-precision item extraction, run experiments feeding only DocREC sections and compare F1/Precision.
Add 'no-leaderboard' training examples (unanswerable) so the model returns 'unanswerable' instead of inventing tuples.
Agent Features
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Corpus is a snapshot rebuilt from PwC annotations (downloaded Dec 09, 2023) and may miss recent papers.
Experiments use only 7B variants (Mistral and Llama-2); results may change for larger models.
When Not To Use
When you need near-perfect, audited numeric extraction for downstream decisions without human review.
If you cannot reconstruct or reliably extract the targeted sections (DocTAET/DocREC) from paper sources.
Failure Modes
Hallucinated tuples when context is too long or unfocused (DocFULL).
Missed numeric scores or wrong model-to-metric pairing.

