Use small local LLMs to separate true SDG contributions from incidental keyword mentions

Overview

Decision SnapshotNeeds Validation

Promising prototype: uses accessible local LLMs but evidence is limited to abstracts and a single SDG; expect extra engineering and validation before production use.

Citations1

Evidence Strength0.60

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

William A. Ingram, Bipasha Banerjee, Edward A. Fox

Links

Abstract / PDF

Why It Matters For Business

Universities and research managers can avoid inflated SDG counts from keyword hits and make funding, ranking, and reporting decisions based on substantively relevant work.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

Keyword searches return many papers that mention SDG terms without real contribution. This study retrieves 20,000 Scopus abstracts per SDG and uses small, locally hosted LLMs as evaluation agents to re-classify abstracts as 'Relevant' or 'Non-Relevant' to an SDG target. On SDG 1, three models differed strongly in selectivity (Phi-3.5: 52% relevant; Mistral-7B: 70%; Llama-3.2: 15%). The authors propose ensembles of complementary models to balance inclusiveness and precision. Main limits: prompt sensitivity, abstracts-only data, and focus on SDG 1.

Problem Statement

Keyword-based SDG searches give many false positives because they match words, not substantive contributions. Institutions need a practical method to measure research that actually advances SDG targets rather than just mentioning them.

Main Contribution

Introduce an LLM-driven evaluation agent that classifies abstracts as substantive or superficial for SDG targets.

Apply the agent to a large Scopus collection (20,000 abstracts per SDG using Elsevier SDG queries).

Key Findings

Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.

Practical UseRe-score keyword-retrieved abstracts with a small LLM to improve the precision of institutional SDG metrics.

Evidence RefAbstract, Method, Conclusion

Model selectivity varied strongly on SDG 1: Phi-3.5-mini labeled 52% relevant, Mistral-7B labeled 70% relevant, Llama-3.2 labeled 15% relevant.

NumbersPhi-3.5: 52% relevant; Mistral-7B: 70% relevant; Llama-3.2: 15% relevant

Practical UseExpect very different recall/precision trade-offs across models; pick or combine models based on your tolerance for false positives.

Evidence RefIII. RESULTS; Fig. 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Percent of abstracts labeled Relevant (SDG 1)	Phi-3.5: 52% \| Mistral-7B: 70% \| Llama-3.2: 15%	Keyword-based retrieval (implicit baseline: all retrieved abstracts)	—	SDG 1 abstracts from Scopus	III. RESULTS; Fig. 2	Fig. 2
Inter-model agreement patterns	Low overlap on 'Relevant' labels; higher alignment on 'Non-Relevant'	—	—	SDG 1	III. RESULTS; Fig. 3 Venn diagrams	Fig. 3

What To Try In 7 Days

Run a local small LLM over a keyword-retrieved set and compare model 'Relevant' rates.

Create a prompt listing SDG target criteria and two short example abstracts (relevant / non-relevant).

Inspect 100 high- and low-confidence classifications and adjust prompt wording or thresholds.

Agent Features

Memory

short-term context window (prompt + abstract)

Tool Use

prompt-driven classification

Frameworks

single-model evaluation agent per SDG; proposed multi-agent ensemble

Is Agentic

Yes

Architectures

instruction-tuned decoder-only LLMs

Collaboration

ensemble / multi-agent conversation (proposed)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Prompt sensitivity: phrasing can change outcomes and reduce generalizability

Evaluation used abstracts not full text, so some relevance signals may be missing

When Not To Use

When full-text context is required for accurate relevance judgment

For formal institutional reporting before cross-validation and human review

Failure Modes

Classifying superficial mentions as substantive (false positives)

Overly strict models that miss indirect but real contributions (false negatives)

Core Entities

Models

Phi-3.5-mini-instructMistral-7B-Instruct-v0.3Llama-3.2-3B-Instruct

Metrics

Percent labeled Relevant (per model)Inter-model agreement (Venn overlaps)

Datasets

Scopus abstracts via Elsevier SDG mapping (20,000 abstracts per SDG retrieval sets)

Context Entities

Datasets

Elsevier SDG Research Mapping queries

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.

Model selectivity varied strongly on SDG 1: Phi-3.5-mini labeled 52% relevant, Mistral-7B labeled 70% relevant, Llama-3.2 labeled 15% relevant.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding