PNCExtract: a full-paper benchmark and LLM prompts to pull polymer nanocomposite samples

Overview

Decision SnapshotNeeds Validation

The paper delivers a new dataset and concrete zero-shot baselines showing LLMs can help data curation, but current extraction quality and missing modalities limit turnkey production use.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 30%

Novelty: 50%

Authors

Ghazal Khalighinejad, Defne Circi, L. C. Brinson, Bhuwan Dhingra

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automating extraction of polymer nanocomposite compositions speeds dataset creation for materials discovery but zero-shot LLMs still miss many entries, so expect a hybrid workflow with LLM-assisted triage plus human validation.

Who Should Care

Data Scientist ML Engineer Product Manager CTO Founder

Summary TLDR

The authors build PNCExtract, a dataset of 193 full-length polymer nanocomposite (PNC) papers with 1,052 labeled samples and six main attributes. They test zero-shot prompting of LLMs (E2E vs NER+RE), add a list-style self-consistency method, and show document condensation with a dense retriever helps. GPT-4 (E2E) gives the best zero-shot results (partial F1 ≈ 54–55%), but many true samples are still missed. Key limits: text-only models, scattered attributes in figures/tables, and variable chemical names.

Problem Statement

Extracting full sample lists from PNC research papers is hard because each sample is an N-ary object (matrix, filler, composition) whose attributes are scattered across text, figures, and tables; labeled data are scarce and full-document context is long, which breaks conventional encoder-only pipelines.

Main Contribution

PNCExtract dataset: 193 full papers, 1,052 samples, six selected attributes per sample.

Dual evaluation: strict exact-match metric and a partial F1 metric that rewards partial matches.

Key Findings

Dataset size and scope

Numbers193 papers; 1,052 ground-truth samples

Practical UseYou can use PNCExtract to test document-level extraction tools on realistic full-paper PNC data.

Evidence RefSection 2.2, Table 3

Best zero-shot LLM performance (partial metric)

NumbersGPT-4 (E2E, condensed) partial F1 = 54.8%; +self-consistency partial F1 = 54.9%

Practical UseGPT-4 can extract many attributes without training, but expect only ~55% partial F1 on this task in zero-shot.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 Turbo (condensed, E2E) partial F1	54.8%	—	—	PNCExtract (condensed papers)	Table 4 reports partial F1 54.8 for GPT-4 on condensed papers	Table 4
GPT-4 Turbo + self-consistency strict F1	38.8%	—	—	PNCExtract (condensed papers)	Table 4 shows strict F1 rises to 38.8 with SC	Table 4

What To Try In 7 Days

Run GPT-4 with the paper's E2E JSON prompt on a small corpus to inspect extracted samples.

Apply dense retrieval (GTR-large) to condense papers and compare extraction quality before/after.

Use self-consistency (≈8 runs, α=3) to filter high-confidence samples and prioritize manual checking.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ghazalkhalighinejad/PNCExtract

Data URLs

NanoMine repository (Zhao et al., 2018)

Risks & Boundaries

Limitations

Study is text-only; figures and tables are not parsed by models.

NanoMine contains annotation inconsistencies and some corrections were manual.

When Not To Use

When you need fully correct, production-ready sample records without human review.

When key sample details appear only in figures or tables.

Failure Modes

Missed samples scattered across paper sections and figures.

Wrong composition values or units due to inconsistent formatting.

Core Entities

Models

GPT-4 TurboLLaMA2-7b-chatLongChat-7B-16KVicuna-7B-v1.5Vicuna-7B-v1.5-16K

Metrics

Partial-F1Strict-F1PrecisionRecallF1

Datasets

PNCExtractNanoMineSciREX

Benchmarks

PNCExtractSciREX

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and scope

Best zero-shot LLM performance (partial metric)

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding