Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
1
Why It Matters For Business
Feeding models only the right paper sections speeds up extraction, reduces hallucinations, and improves accuracy for leaderboard curation—lowering manual review and infrastructure cost.
Summary TLDR
The paper studies how which parts of a research paper you feed to an LLM affects automatic extraction of (Task, Dataset, Metric, Score) tuples for leaderboards. They build a 8k-paper corpus from community annotations, finetune 7B LLMs (Mistral and Llama-2) with FLAN-style instructions using QLoRA, and compare three context choices: DocTAET (title+abstract+experiments+tables), DocREC (results+experiments+conclusion), and DocFULL (full paper). Key practical findings: DocTAET best for detecting whether a paper has a leaderboard and for structured summary metrics; DocREC gives the best fine-grained element extraction; DocFULL performs poorly and increases hallucinations. Results are measured by
Problem Statement
Automatic creation of AI leaderboards means extracting (Task, Dataset, Metric, Score) tuples from papers. Existing NLI approaches require fixed taxonomies and struggle to adapt. This paper asks: which parts of a paper (context) should an instruction‑finetuned LLM see to maximize accuracy and reduce hallucination when generating leaderboards?
Main Contribution
A new corpus of ~8k papers with (Task, Dataset, Metric, Score) annotations reconstructed from community (PwC) exports and arXiv sources.
A controlled comparison of three context-selection strategies: DocTAET (targeted sections), DocREC (results/experiments/conclusion), and DocFULL (entire paper).
Empirical finetuning of two 7B LLMs (Mistral-7B, Llama-2 7B) with FLAN instructions using QLoRA and a comprehensive evaluation (ROUGE, F1, precision, accuracy).
Key Findings
Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.
A results-focused context (DocREC) gives the best fine-grained extraction of individual elements (task, dataset, metric, score).
Using the full paper as context (DocFULL) dramatically lowers extraction quality and increases errors.
Results
Structured summary quality (ROUGE-1, few-shot)
Accuracy
Element extraction overall F1 (partial match, few-shot)
Element extraction Precision (partial match, few-shot)
Full paper context extraction (Overall F1, few-shot)
Who Should Care
What To Try In 7 Days
Reproduce one pipeline: extract DocTAET sections, finetune a 7B model with QLoRA on a small sample, measure accuracy vs. full-text baseline.
For high-precision item extraction, run experiments feeding only DocREC sections and compare F1/Precision.
Add 'no-leaderboard' training examples (unanswerable) so the model returns 'unanswerable' instead of inventing tuples.
Agent Features
Tool Use
- LoRA
Frameworks
- FLAN instruction tuning
Architectures
- instruction-finetuned LLM
Optimization Features
Token Efficiency
- Shorter targeted context (DocTAET/DocREC) to reduce input length and distractors
Model Optimization
- 7B model selection for efficiency
System Optimization
- Batch size and gradient accumulation tuned to GPU limits
Training Optimization
- Instruction finetuning with FLAN templates
- LoRA
Reproducibility
License
- Data derived from PwC (CC BY-SA); Mistral reported Apache-2.0
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Corpus is a snapshot rebuilt from PwC annotations (downloaded Dec 09, 2023) and may miss recent papers.
- Experiments use only 7B variants (Mistral and Llama-2); results may change for larger models.
- Element extraction F1/precision remain modest, especially for numeric score extraction.
- All experiments run on a single GPU (NVIDIA 3090); scaling costs for larger deployments are not measured.
When Not To Use
- When you need near-perfect, audited numeric extraction for downstream decisions without human review.
- If you cannot reconstruct or reliably extract the targeted sections (DocTAET/DocREC) from paper sources.
Failure Modes
- Hallucinated tuples when context is too long or unfocused (DocFULL).
- Missed numeric scores or wrong model-to-metric pairing.
- Low recall on datasets and metrics even when task extraction is correct.
Core Entities
Models
- Mistral-7B
- Llama-2 7B
- FLAN-T5 (instruction collection)
- LoRA
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- F1
- Precision
- Accuracy
Datasets
- DocREC/DocTAET/DocFULL corpus (derived from PwC annotations, snapshot Dec 09 2023)
- SQuAD v2 (instruction templates)
- DROP (instruction templates)
Benchmarks
- ROUGE-1
- ROUGE-2
- ROUGE-L
- ROUGE-Lsum
- F1
- Precision
- Accuracy
Context Entities
Datasets
- DocTAET (title, abstract, experiments, tables)
- DocREC (results, experiments, conclusion)
- DocFULL (full paper)

