Picking the right paper sections (not the whole paper) improves LLM leaderboard extraction and cuts hallucinations

June 6, 20247 min

Overview

Decision SnapshotReady For Pilot

The paper gives thorough empirical numbers on 8k papers and compares three clear context strategies, but extraction precision is still low for production-grade leaderboard population.

Citations1

Evidence Strength0.75

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

License: Data derived from PwC (CC BY-SA); Mistral reported Apache-2.0

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Salomon Kabongo, Jennifer D'Souza, Sören Auer

Links

Abstract / PDF

Why It Matters For Business

Feeding models only the right paper sections speeds up extraction, reduces hallucinations, and improves accuracy for leaderboard curation—lowering manual review and infrastructure cost.

Who Should Care

Summary TLDR

The paper studies how which parts of a research paper you feed to an LLM affects automatic extraction of (Task, Dataset, Metric, Score) tuples for leaderboards. They build a 8k-paper corpus from community annotations, finetune 7B LLMs (Mistral and Llama-2) with FLAN-style instructions using QLoRA, and compare three context choices: DocTAET (title+abstract+experiments+tables), DocREC (results+experiments+conclusion), and DocFULL (full paper). Key practical findings: DocTAET best for detecting whether a paper has a leaderboard and for structured summary metrics; DocREC gives the best fine-grained element extraction; DocFULL performs poorly and increases hallucinations. Results are measured by

Problem Statement

Automatic creation of AI leaderboards means extracting (Task, Dataset, Metric, Score) tuples from papers. Existing NLI approaches require fixed taxonomies and struggle to adapt. This paper asks: which parts of a paper (context) should an instruction‑finetuned LLM see to maximize accuracy and reduce hallucination when generating leaderboards?

Main Contribution

A new corpus of ~8k papers with (Task, Dataset, Metric, Score) annotations reconstructed from community (PwC) exports and arXiv sources.

A controlled comparison of three context-selection strategies: DocTAET (targeted sections), DocREC (results/experiments/conclusion), and DocFULL (entire paper).

Key Findings

Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.

NumbersMistral-7B DocTAET: General Accuracy ≈ 89% (few-shot), 95% (zero-shot); ROUGE-1 ≈ 57.2 (few-shot).

Practical UseWhen you only need to tell whether a paper reports leaderboards or produce a high‑level structured summary, feed the model the concise DocTAET sections (title, abstract, experiments, tables).

Evidence RefTable 2; Section 4.1

A results-focused context (DocREC) gives the best fine-grained extraction of individual elements (task, dataset, metric, score).

NumbersMistral-7B (DocREC/Ours) partial-match Overall F1 ≈ 25.65 (few-shot); partial-match Precision ≈ 36.14.

Practical UseIf you need to extract exact Task/Dataset/Metric/Score entries for a leaderboard, prefer DocREC (results+experiments+conclusion) over full text or title-only inputs.

Evidence RefTable 3 and Table 4; Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Structured summary quality (ROUGE-1, few-shot)Mistral-7B DocTAET: 57.24Test few-shot (DocTAET)Table 2: Mistral-7B ROUGE-1 few-shot = 57.24Table 2
AccuracyMistral-7B DocTAET: ~89% (few-shot), ~95% (zero-shot)Test few-shot / zero-shotTable 2: General Accuracy values reported for Mistral-7B DocTAETSection 4.1, Table 2

What To Try In 7 Days

Reproduce one pipeline: extract DocTAET sections, finetune a 7B model with QLoRA on a small sample, measure accuracy vs. full-text baseline.

For high-precision item extraction, run experiments feeding only DocREC sections and compare F1/Precision.

Add 'no-leaderboard' training examples (unanswerable) so the model returns 'unanswerable' instead of inventing tuples.

Agent Features

Tool Use
LoRA
Frameworks
FLAN instruction tuning
Architectures
instruction-finetuned LLM

Optimization Features

Token Efficiency
Shorter targeted context (DocTAET/DocREC) to reduce input length and distractors
Model Optimization
7B model selection for efficiency
System Optimization
Batch size and gradient accumulation tuned to GPU limits
Training Optimization
Instruction finetuning with FLAN templatesLoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseData derived from PwC (CC BY-SA); Mistral reported Apache-2.0

Risks & Boundaries

Limitations

Corpus is a snapshot rebuilt from PwC annotations (downloaded Dec 09, 2023) and may miss recent papers.

Experiments use only 7B variants (Mistral and Llama-2); results may change for larger models.

When Not To Use

When you need near-perfect, audited numeric extraction for downstream decisions without human review.

If you cannot reconstruct or reliably extract the targeted sections (DocTAET/DocREC) from paper sources.

Failure Modes

Hallucinated tuples when context is too long or unfocused (DocFULL).

Missed numeric scores or wrong model-to-metric pairing.

Core Entities

Models

Mistral-7BLlama-2 7BFLAN-T5 (instruction collection)LoRA

Metrics

ROUGE-1ROUGE-2ROUGE-LF1PrecisionAccuracy

Datasets

DocREC/DocTAET/DocFULL corpus (derived from PwC annotations, snapshot Dec 09 2023)SQuAD v2 (instruction templates)DROP (instruction templates)

Benchmarks

ROUGE-1ROUGE-2ROUGE-LROUGE-LsumF1PrecisionAccuracy

Context Entities

Datasets

DocTAET (title, abstract, experiments, tables)DocREC (results, experiments, conclusion)DocFULL (full paper)