Picking the right paper sections (not the whole paper) improves LLM leaderboard extraction and cuts hallucinations

Overview

Decision SnapshotReady For Pilot

The paper gives thorough empirical numbers on 8k papers and compares three clear context strategies, but extraction precision is still low for production-grade leaderboard population.

Citations1

Evidence Strength0.75

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

License: Data derived from PwC (CC BY-SA); Mistral reported Apache-2.0

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Salomon Kabongo, Jennifer D'Souza, Sören Auer

Links

Abstract / PDF

Why It Matters For Business

Feeding models only the right paper sections speeds up extraction, reduces hallucinations, and improves accuracy for leaderboard curation—lowering manual review and infrastructure cost.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The paper studies how which parts of a research paper you feed to an LLM affects automatic extraction of (Task, Dataset, Metric, Score) tuples for leaderboards. They build a 8k-paper corpus from community annotations, finetune 7B LLMs (Mistral and Llama-2) with FLAN-style instructions using QLoRA, and compare three context choices: DocTAET (title+abstract+experiments+tables), DocREC (results+experiments+conclusion), and DocFULL (full paper). Key practical findings: DocTAET best for detecting whether a paper has a leaderboard and for structured summary metrics; DocREC gives the best fine-grained element extraction; DocFULL performs poorly and increases hallucinations. Results are measured by

Problem Statement

Automatic creation of AI leaderboards means extracting (Task, Dataset, Metric, Score) tuples from papers. Existing NLI approaches require fixed taxonomies and struggle to adapt. This paper asks: which parts of a paper (context) should an instruction‑finetuned LLM see to maximize accuracy and reduce hallucination when generating leaderboards?

Main Contribution

A new corpus of ~8k papers with (Task, Dataset, Metric, Score) annotations reconstructed from community (PwC) exports and arXiv sources.

A controlled comparison of three context-selection strategies: DocTAET (targeted sections), DocREC (results/experiments/conclusion), and DocFULL (entire paper).

Key Findings

Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.

NumbersMistral-7B DocTAET: General Accuracy ≈ 89% (few-shot), 95% (zero-shot); ROUGE-1 ≈ 57.2 (few-shot).

Practical UseWhen you only need to tell whether a paper reports leaderboards or produce a high‑level structured summary, feed the model the concise DocTAET sections (title, abstract, experiments, tables).

Evidence RefTable 2; Section 4.1

A results-focused context (DocREC) gives the best fine-grained extraction of individual elements (task, dataset, metric, score).

NumbersMistral-7B (DocREC/Ours) partial-match Overall F1 ≈ 25.65 (few-shot); partial-match Precision ≈ 36.14.

Practical UseIf you need to extract exact Task/Dataset/Metric/Score entries for a leaderboard, prefer DocREC (results+experiments+conclusion) over full text or title-only inputs.

Evidence RefTable 3 and Table 4; Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Structured summary quality (ROUGE-1, few-shot)	Mistral-7B DocTAET: 57.24	—	—	Test few-shot (DocTAET)	Table 2: Mistral-7B ROUGE-1 few-shot = 57.24	Table 2
Accuracy	Mistral-7B DocTAET: ~89% (few-shot), ~95% (zero-shot)	—	—	Test few-shot / zero-shot	Table 2: General Accuracy values reported for Mistral-7B DocTAET	Section 4.1, Table 2

What To Try In 7 Days

Reproduce one pipeline: extract DocTAET sections, finetune a 7B model with QLoRA on a small sample, measure accuracy vs. full-text baseline.

For high-precision item extraction, run experiments feeding only DocREC sections and compare F1/Precision.

Add 'no-leaderboard' training examples (unanswerable) so the model returns 'unanswerable' instead of inventing tuples.

Agent Features

Tool Use

LoRA

Frameworks

FLAN instruction tuning

Architectures

instruction-finetuned LLM

Optimization Features

Token Efficiency

Shorter targeted context (DocTAET/DocREC) to reduce input length and distractors

Model Optimization

7B model selection for efficiency

System Optimization

Batch size and gradient accumulation tuned to GPU limits

Training Optimization

Instruction finetuning with FLAN templatesLoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseData derived from PwC (CC BY-SA); Mistral reported Apache-2.0

Risks & Boundaries

Limitations

Corpus is a snapshot rebuilt from PwC annotations (downloaded Dec 09, 2023) and may miss recent papers.

Experiments use only 7B variants (Mistral and Llama-2); results may change for larger models.

When Not To Use

When you need near-perfect, audited numeric extraction for downstream decisions without human review.

If you cannot reconstruct or reliably extract the targeted sections (DocTAET/DocREC) from paper sources.

Failure Modes

Hallucinated tuples when context is too long or unfocused (DocFULL).

Missed numeric scores or wrong model-to-metric pairing.

Core Entities

Models

Mistral-7BLlama-2 7BFLAN-T5 (instruction collection)LoRA

Metrics

ROUGE-1ROUGE-2ROUGE-LF1PrecisionAccuracy

Datasets

DocREC/DocTAET/DocFULL corpus (derived from PwC annotations, snapshot Dec 09 2023)SQuAD v2 (instruction templates)DROP (instruction templates)

Benchmarks

ROUGE-1ROUGE-2ROUGE-LROUGE-LsumF1PrecisionAccuracy

Context Entities

Datasets

DocTAET (title, abstract, experiments, tables)DocREC (results, experiments, conclusion)DocFULL (full paper)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.

A results-focused context (DocREC) gives the best fine-grained extraction of individual elements (task, dataset, metric, score).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding