Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Universities and research managers can avoid inflated SDG counts from keyword hits and make funding, ranking, and reporting decisions based on substantively relevant work.
Summary TLDR
Keyword searches return many papers that mention SDG terms without real contribution. This study retrieves 20,000 Scopus abstracts per SDG and uses small, locally hosted LLMs as evaluation agents to re-classify abstracts as 'Relevant' or 'Non-Relevant' to an SDG target. On SDG 1, three models differed strongly in selectivity (Phi-3.5: 52% relevant; Mistral-7B: 70%; Llama-3.2: 15%). The authors propose ensembles of complementary models to balance inclusiveness and precision. Main limits: prompt sensitivity, abstracts-only data, and focus on SDG 1.
Problem Statement
Keyword-based SDG searches give many false positives because they match words, not substantive contributions. Institutions need a practical method to measure research that actually advances SDG targets rather than just mentioning them.
Main Contribution
Introduce an LLM-driven evaluation agent that classifies abstracts as substantive or superficial for SDG targets.
Apply the agent to a large Scopus collection (20,000 abstracts per SDG using Elsevier SDG queries).
Compare three small, locally hostable LLMs (Phi-3.5-mini, Mistral-7B-v0.3, Llama-3.2-3B) and report differing classification tendencies.
Recommend ensemble / multi-agent approaches to combine inclusive and strict classifiers for better precision and recall balance.
Key Findings
Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.
Model selectivity varied strongly on SDG 1: Phi-3.5-mini labeled 52% relevant, Mistral-7B labeled 70% relevant, Llama-3.2 labeled 15% relevant.
Inter-model agreement is low for 'Relevant' labels but higher for 'Non-Relevant' labels.
System-level limits include prompt sensitivity, use of abstracts instead of full text, and primary focus on SDG 1.
Results
Percent of abstracts labeled Relevant (SDG 1)
Inter-model agreement patterns
Who Should Care
What To Try In 7 Days
Run a local small LLM over a keyword-retrieved set and compare model 'Relevant' rates.
Create a prompt listing SDG target criteria and two short example abstracts (relevant / non-relevant).
Inspect 100 high- and low-confidence classifications and adjust prompt wording or thresholds.
Agent Features
Memory
- short-term context window (prompt + abstract)
Tool Use
- prompt-driven classification
Frameworks
- single-model evaluation agent per SDG; proposed multi-agent ensemble
Is Agentic
true
Architectures
- instruction-tuned decoder-only LLMs
Collaboration
- ensemble / multi-agent conversation (proposed)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Prompt sensitivity: phrasing can change outcomes and reduce generalizability
- Evaluation used abstracts not full text, so some relevance signals may be missing
- Study focuses mainly on SDG 1; cross-SDG behavior is not demonstrated
When Not To Use
- When full-text context is required for accurate relevance judgment
- For formal institutional reporting before cross-validation and human review
- When consistent, auditable criteria are legally or procedurally required
Failure Modes
- Classifying superficial mentions as substantive (false positives)
- Overly strict models that miss indirect but real contributions (false negatives)
- Different model thresholds producing inconsistent institutional counts
Core Entities
Models
- Phi-3.5-mini-instruct
- Mistral-7B-Instruct-v0.3
- Llama-3.2-3B-Instruct
Metrics
- Percent labeled Relevant (per model)
- Inter-model agreement (Venn overlaps)
Datasets
- Scopus abstracts via Elsevier SDG mapping (20,000 abstracts per SDG retrieval sets)
Context Entities
Datasets
- Elsevier SDG Research Mapping queries

