Overview
Promising prototype: uses accessible local LLMs but evidence is limited to abstracts and a single SDG; expect extra engineering and validation before production use.
Citations1
Evidence Strength0.60
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 1/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/2
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Universities and research managers can avoid inflated SDG counts from keyword hits and make funding, ranking, and reporting decisions based on substantively relevant work.
Who Should Care
Summary TLDR
Keyword searches return many papers that mention SDG terms without real contribution. This study retrieves 20,000 Scopus abstracts per SDG and uses small, locally hosted LLMs as evaluation agents to re-classify abstracts as 'Relevant' or 'Non-Relevant' to an SDG target. On SDG 1, three models differed strongly in selectivity (Phi-3.5: 52% relevant; Mistral-7B: 70%; Llama-3.2: 15%). The authors propose ensembles of complementary models to balance inclusiveness and precision. Main limits: prompt sensitivity, abstracts-only data, and focus on SDG 1.
Problem Statement
Keyword-based SDG searches give many false positives because they match words, not substantive contributions. Institutions need a practical method to measure research that actually advances SDG targets rather than just mentioning them.
Main Contribution
Introduce an LLM-driven evaluation agent that classifies abstracts as substantive or superficial for SDG targets.
Apply the agent to a large Scopus collection (20,000 abstracts per SDG using Elsevier SDG queries).
Key Findings
Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.
Model selectivity varied strongly on SDG 1: Phi-3.5-mini labeled 52% relevant, Mistral-7B labeled 70% relevant, Llama-3.2 labeled 15% relevant.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Percent of abstracts labeled Relevant (SDG 1) | Phi-3.5: 52% | Mistral-7B: 70% | Llama-3.2: 15% | Keyword-based retrieval (implicit baseline: all retrieved abstracts) | — | SDG 1 abstracts from Scopus | III. RESULTS; Fig. 2 | Fig. 2 |
| Inter-model agreement patterns | Low overlap on 'Relevant' labels; higher alignment on 'Non-Relevant' | — | — | SDG 1 | III. RESULTS; Fig. 3 Venn diagrams | Fig. 3 |
What To Try In 7 Days
Run a local small LLM over a keyword-retrieved set and compare model 'Relevant' rates.
Create a prompt listing SDG target criteria and two short example abstracts (relevant / non-relevant).
Inspect 100 high- and low-confidence classifications and adjust prompt wording or thresholds.
Agent Features
Memory
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Prompt sensitivity: phrasing can change outcomes and reduce generalizability
Evaluation used abstracts not full text, so some relevance signals may be missing
When Not To Use
When full-text context is required for accurate relevance judgment
For formal institutional reporting before cross-validation and human review
Failure Modes
Classifying superficial mentions as substantive (false positives)
Overly strict models that miss indirect but real contributions (false negatives)

