Use small local LLMs to separate true SDG contributions from incidental keyword mentions

November 26, 20246 min

Overview

Decision SnapshotNeeds Validation

Promising prototype: uses accessible local LLMs but evidence is limited to abstracts and a single SDG; expect extra engineering and validation before production use.

Citations1

Evidence Strength0.60

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

William A. Ingram, Bipasha Banerjee, Edward A. Fox

Links

Abstract / PDF

Why It Matters For Business

Universities and research managers can avoid inflated SDG counts from keyword hits and make funding, ranking, and reporting decisions based on substantively relevant work.

Who Should Care

Summary TLDR

Keyword searches return many papers that mention SDG terms without real contribution. This study retrieves 20,000 Scopus abstracts per SDG and uses small, locally hosted LLMs as evaluation agents to re-classify abstracts as 'Relevant' or 'Non-Relevant' to an SDG target. On SDG 1, three models differed strongly in selectivity (Phi-3.5: 52% relevant; Mistral-7B: 70%; Llama-3.2: 15%). The authors propose ensembles of complementary models to balance inclusiveness and precision. Main limits: prompt sensitivity, abstracts-only data, and focus on SDG 1.

Problem Statement

Keyword-based SDG searches give many false positives because they match words, not substantive contributions. Institutions need a practical method to measure research that actually advances SDG targets rather than just mentioning them.

Main Contribution

Introduce an LLM-driven evaluation agent that classifies abstracts as substantive or superficial for SDG targets.

Apply the agent to a large Scopus collection (20,000 abstracts per SDG using Elsevier SDG queries).

Key Findings

Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.

Practical UseRe-score keyword-retrieved abstracts with a small LLM to improve the precision of institutional SDG metrics.

Evidence RefAbstract, Method, Conclusion

Model selectivity varied strongly on SDG 1: Phi-3.5-mini labeled 52% relevant, Mistral-7B labeled 70% relevant, Llama-3.2 labeled 15% relevant.

NumbersPhi-3.5: 52% relevant; Mistral-7B: 70% relevant; Llama-3.2: 15% relevant

Practical UseExpect very different recall/precision trade-offs across models; pick or combine models based on your tolerance for false positives.

Evidence RefIII. RESULTS; Fig. 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Percent of abstracts labeled Relevant (SDG 1)Phi-3.5: 52% | Mistral-7B: 70% | Llama-3.2: 15%Keyword-based retrieval (implicit baseline: all retrieved abstracts)SDG 1 abstracts from ScopusIII. RESULTS; Fig. 2Fig. 2
Inter-model agreement patternsLow overlap on 'Relevant' labels; higher alignment on 'Non-Relevant'SDG 1III. RESULTS; Fig. 3 Venn diagramsFig. 3

What To Try In 7 Days

Run a local small LLM over a keyword-retrieved set and compare model 'Relevant' rates.

Create a prompt listing SDG target criteria and two short example abstracts (relevant / non-relevant).

Inspect 100 high- and low-confidence classifications and adjust prompt wording or thresholds.

Agent Features

Memory
short-term context window (prompt + abstract)
Tool Use
prompt-driven classification
Frameworks
single-model evaluation agent per SDG; proposed multi-agent ensemble
Is Agentic

Yes

Architectures
instruction-tuned decoder-only LLMs
Collaboration
ensemble / multi-agent conversation (proposed)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Prompt sensitivity: phrasing can change outcomes and reduce generalizability

Evaluation used abstracts not full text, so some relevance signals may be missing

When Not To Use

When full-text context is required for accurate relevance judgment

For formal institutional reporting before cross-validation and human review

Failure Modes

Classifying superficial mentions as substantive (false positives)

Overly strict models that miss indirect but real contributions (false negatives)

Core Entities

Models

Phi-3.5-mini-instructMistral-7B-Instruct-v0.3Llama-3.2-3B-Instruct

Metrics

Percent labeled Relevant (per model)Inter-model agreement (Venn overlaps)

Datasets

Scopus abstracts via Elsevier SDG mapping (20,000 abstracts per SDG retrieval sets)

Context Entities

Datasets

Elsevier SDG Research Mapping queries