Use small local LLMs to separate true SDG contributions from incidental keyword mentions

November 26, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

William A. Ingram, Bipasha Banerjee, Edward A. Fox

Links

Abstract / PDF

Why It Matters For Business

Universities and research managers can avoid inflated SDG counts from keyword hits and make funding, ranking, and reporting decisions based on substantively relevant work.

Summary TLDR

Keyword searches return many papers that mention SDG terms without real contribution. This study retrieves 20,000 Scopus abstracts per SDG and uses small, locally hosted LLMs as evaluation agents to re-classify abstracts as 'Relevant' or 'Non-Relevant' to an SDG target. On SDG 1, three models differed strongly in selectivity (Phi-3.5: 52% relevant; Mistral-7B: 70%; Llama-3.2: 15%). The authors propose ensembles of complementary models to balance inclusiveness and precision. Main limits: prompt sensitivity, abstracts-only data, and focus on SDG 1.

Problem Statement

Keyword-based SDG searches give many false positives because they match words, not substantive contributions. Institutions need a practical method to measure research that actually advances SDG targets rather than just mentioning them.

Main Contribution

Introduce an LLM-driven evaluation agent that classifies abstracts as substantive or superficial for SDG targets.

Apply the agent to a large Scopus collection (20,000 abstracts per SDG using Elsevier SDG queries).

Compare three small, locally hostable LLMs (Phi-3.5-mini, Mistral-7B-v0.3, Llama-3.2-3B) and report differing classification tendencies.

Recommend ensemble / multi-agent approaches to combine inclusive and strict classifiers for better precision and recall balance.

Key Findings

Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.

Model selectivity varied strongly on SDG 1: Phi-3.5-mini labeled 52% relevant, Mistral-7B labeled 70% relevant, Llama-3.2 labeled 15% relevant.

NumbersPhi-3.5: 52% relevant; Mistral-7B: 70% relevant; Llama-3.2: 15% relevant

Inter-model agreement is low for 'Relevant' labels but higher for 'Non-Relevant' labels.

System-level limits include prompt sensitivity, use of abstracts instead of full text, and primary focus on SDG 1.

Results

Percent of abstracts labeled Relevant (SDG 1)

ValuePhi-3.5: 52% | Mistral-7B: 70% | Llama-3.2: 15%

BaselineKeyword-based retrieval (implicit baseline: all retrieved abstracts)

Inter-model agreement patterns

ValueLow overlap on 'Relevant' labels; higher alignment on 'Non-Relevant'

Who Should Care

What To Try In 7 Days

Run a local small LLM over a keyword-retrieved set and compare model 'Relevant' rates.

Create a prompt listing SDG target criteria and two short example abstracts (relevant / non-relevant).

Inspect 100 high- and low-confidence classifications and adjust prompt wording or thresholds.

Agent Features

Memory

  • short-term context window (prompt + abstract)

Tool Use

  • prompt-driven classification

Frameworks

  • single-model evaluation agent per SDG; proposed multi-agent ensemble

Is Agentic

true

Architectures

  • instruction-tuned decoder-only LLMs

Collaboration

  • ensemble / multi-agent conversation (proposed)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Prompt sensitivity: phrasing can change outcomes and reduce generalizability
  • Evaluation used abstracts not full text, so some relevance signals may be missing
  • Study focuses mainly on SDG 1; cross-SDG behavior is not demonstrated

When Not To Use

  • When full-text context is required for accurate relevance judgment
  • For formal institutional reporting before cross-validation and human review
  • When consistent, auditable criteria are legally or procedurally required

Failure Modes

  • Classifying superficial mentions as substantive (false positives)
  • Overly strict models that miss indirect but real contributions (false negatives)
  • Different model thresholds producing inconsistent institutional counts

Core Entities

Models

  • Phi-3.5-mini-instruct
  • Mistral-7B-Instruct-v0.3
  • Llama-3.2-3B-Instruct

Metrics

  • Percent labeled Relevant (per model)
  • Inter-model agreement (Venn overlaps)

Datasets

  • Scopus abstracts via Elsevier SDG mapping (20,000 abstracts per SDG retrieval sets)

Context Entities

Datasets

  • Elsevier SDG Research Mapping queries