Picking the right paper sections (not the whole paper) improves LLM leaderboard extraction and cuts hallucinations

June 6, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

1

Authors

Salomon Kabongo, Jennifer D'Souza, Sören Auer

Links

Abstract / PDF

Why It Matters For Business

Feeding models only the right paper sections speeds up extraction, reduces hallucinations, and improves accuracy for leaderboard curation—lowering manual review and infrastructure cost.

Summary TLDR

The paper studies how which parts of a research paper you feed to an LLM affects automatic extraction of (Task, Dataset, Metric, Score) tuples for leaderboards. They build a 8k-paper corpus from community annotations, finetune 7B LLMs (Mistral and Llama-2) with FLAN-style instructions using QLoRA, and compare three context choices: DocTAET (title+abstract+experiments+tables), DocREC (results+experiments+conclusion), and DocFULL (full paper). Key practical findings: DocTAET best for detecting whether a paper has a leaderboard and for structured summary metrics; DocREC gives the best fine-grained element extraction; DocFULL performs poorly and increases hallucinations. Results are measured by

Problem Statement

Automatic creation of AI leaderboards means extracting (Task, Dataset, Metric, Score) tuples from papers. Existing NLI approaches require fixed taxonomies and struggle to adapt. This paper asks: which parts of a paper (context) should an instruction‑finetuned LLM see to maximize accuracy and reduce hallucination when generating leaderboards?

Main Contribution

A new corpus of ~8k papers with (Task, Dataset, Metric, Score) annotations reconstructed from community (PwC) exports and arXiv sources.

A controlled comparison of three context-selection strategies: DocTAET (targeted sections), DocREC (results/experiments/conclusion), and DocFULL (entire paper).

Empirical finetuning of two 7B LLMs (Mistral-7B, Llama-2 7B) with FLAN instructions using QLoRA and a comprehensive evaluation (ROUGE, F1, precision, accuracy).

Key Findings

Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.

NumbersMistral-7B DocTAET: General Accuracy ≈ 89% (few-shot), 95% (zero-shot); ROUGE-1 ≈ 57.2 (few-shot).

A results-focused context (DocREC) gives the best fine-grained extraction of individual elements (task, dataset, metric, score).

NumbersMistral-7B (DocREC/Ours) partial-match Overall F1 ≈ 25.65 (few-shot); partial-match Precision ≈ 36.14.

Using the full paper as context (DocFULL) dramatically lowers extraction quality and increases errors.

NumbersMistral-7B DocFULL exact-match Overall F1 ≈ 0.63 to 0.92 (few-shot); many element scores near single-digit percentages.

Results

Structured summary quality (ROUGE-1, few-shot)

ValueMistral-7B DocTAET: 57.24

Accuracy

ValueMistral-7B DocTAET: ~89% (few-shot), ~95% (zero-shot)

Element extraction overall F1 (partial match, few-shot)

ValueMistral-7B DocREC (Ours): 25.65

Element extraction Precision (partial match, few-shot)

ValueMistral-7B DocREC (Ours): 36.14

Full paper context extraction (Overall F1, few-shot)

ValueMistral-7B DocFULL exact-match Overall F1 ≈ 0.63

Who Should Care

What To Try In 7 Days

Reproduce one pipeline: extract DocTAET sections, finetune a 7B model with QLoRA on a small sample, measure accuracy vs. full-text baseline.

For high-precision item extraction, run experiments feeding only DocREC sections and compare F1/Precision.

Add 'no-leaderboard' training examples (unanswerable) so the model returns 'unanswerable' instead of inventing tuples.

Agent Features

Tool Use

  • LoRA

Frameworks

  • FLAN instruction tuning

Architectures

  • instruction-finetuned LLM

Optimization Features

Token Efficiency

  • Shorter targeted context (DocTAET/DocREC) to reduce input length and distractors

Model Optimization

  • 7B model selection for efficiency

System Optimization

  • Batch size and gradient accumulation tuned to GPU limits

Training Optimization

  • Instruction finetuning with FLAN templates
  • LoRA

Reproducibility

License

  • Data derived from PwC (CC BY-SA); Mistral reported Apache-2.0

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Corpus is a snapshot rebuilt from PwC annotations (downloaded Dec 09, 2023) and may miss recent papers.
  • Experiments use only 7B variants (Mistral and Llama-2); results may change for larger models.
  • Element extraction F1/precision remain modest, especially for numeric score extraction.
  • All experiments run on a single GPU (NVIDIA 3090); scaling costs for larger deployments are not measured.

When Not To Use

  • When you need near-perfect, audited numeric extraction for downstream decisions without human review.
  • If you cannot reconstruct or reliably extract the targeted sections (DocTAET/DocREC) from paper sources.

Failure Modes

  • Hallucinated tuples when context is too long or unfocused (DocFULL).
  • Missed numeric scores or wrong model-to-metric pairing.
  • Low recall on datasets and metrics even when task extraction is correct.

Core Entities

Models

  • Mistral-7B
  • Llama-2 7B
  • FLAN-T5 (instruction collection)
  • LoRA

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • F1
  • Precision
  • Accuracy

Datasets

  • DocREC/DocTAET/DocFULL corpus (derived from PwC annotations, snapshot Dec 09 2023)
  • SQuAD v2 (instruction templates)
  • DROP (instruction templates)

Benchmarks

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • ROUGE-Lsum
  • F1
  • Precision
  • Accuracy

Context Entities

Datasets

  • DocTAET (title, abstract, experiments, tables)
  • DocREC (results, experiments, conclusion)
  • DocFULL (full paper)