A hierarchical multi-agent ESG analyst plus a 3-level benchmark built from 310 corporate sustainability reports

January 13, 20268 min

Overview

Decision SnapshotNeeds Validation

The system is practical for enterprise ESG workflows because it combines retrieval, domain tools, and iterative planning; however it depends on web access, closed-source model components, and LLM judges whose biases must be managed.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Yilei Zhao, Wentao Zhang, Lei Xiao, Yandan Zheng, Mengpu Liu, Wei Yang Bryan Lim

Links

Abstract / PDF / Code

Why It Matters For Business

Automating professional-grade ESG audits requires retrieval, web research, and domain tools; a specialized agent produces more verifiable and visualization-rich reports than off-the-shelf LLMs, improving auditability and decision support.

Who Should Care

Summary TLDR

This paper introduces ESGAgent, a hierarchical multi-agent system with domain tools (retriever, web browser, Python interpreter, plotter, report generator) for professional sustainability audits. The authors also release a three-level ESG Benchmark built from 310 Dow Jones company sustainability reports (2010–2024). ESGAgent beats strong closed-source LLM baselines on factoid and compositional tasks (84.15% overall accuracy on Levels 1–2) and produces more chart- and citation-rich Level-3 reports; ablations show external web search and retrieval are key to its gains. The benchmark evaluates correctness, faithfulness, analysis depth, and presentation quality using a multi-judge LLM ensemble.

Problem Statement

ESG analysis needs deep, cross-document, multimodal reasoning and quantitative checks, but data is fragmented and standard LLMs or static benchmarks do not capture expert multi-step workflows and report-level rigor needed for professional sustainability auditing.

Main Contribution

ESGAgent: a hierarchical multi-agent system with a domain toolset for professional ESG analysis (retriever, deep researcher/browser, python interpreter, plotter, reporter).

ESG Benchmark: a curated three-level benchmark from 310 DJIA sustainability reports (2010–2024) that spans atomic QA to open-ended analytical report generation.

Key Findings

ESGAgent achieves 84.15% overall accuracy on Level 1–2 tasks, outperforming Gemini-3-flash (80.89%).

NumbersTotal Acc 84.15% vs 80.89% (Table 3)

Practical UseIf you need higher factoid/compositional accuracy on ESG QA, integrate a domain-aware multi-agent pipeline plus retrieval and web search rather than relying on a single general LLM.

Evidence RefTable 3

Removing the deep research (web search) module drops Level-2 accuracy from 77.19% to 65.79%.

NumbersLevel-2 acc: 77.19% -> 65.79% without deep research (Table 3)

Practical UseFor compositional ESG tasks that need current or external evidence, include an external search tool; internal corpora alone are often insufficient.

Evidence RefTable 3 (ablation rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy90.15%Level 1 tasks (117/119 correct)Table 3 reports Level-1 acc 90.15% (119 correct out of 132?)Table 3
Accuracy77.19%Level 2 tasks (88 correct)Table 3 reports Level-2 acc 77.19% (88 correct)Table 3

What To Try In 7 Days

Index your target company's sustainability reports into a vector DB and run a small RAG pipeline to surface evidence for 5 factual ESG questions.

Prototype a simple planner that splits an audit task into: extract disclosures, compute one KPI (e.g., WACI), and render a chart.

Run an ablation test: compare outputs with and without external web search to see the real-time evidence gap.

Agent Features

Memory
RAG with LightRAG indexknowledge graph augmentationcentralized general memory for orchestrationtool-level local memory
Planning
top-level planner decomposes user queries into sub-tasksiterative re-execution and refinement loop for unsatisfied sub-tasks
Tool Use
retriever over local vector DBweb deep researcher/browserpython interpreter for calculationsplotter for visualizationsreport tool for final assembly
Frameworks
LightRAGKnowledge Graph reasoning
Is Agentic

Yes

Architectures
hierarchical multi-agent
Collaboration
specialized sub-agents for domain taskscoordinator/orchestrator that synthesizes observations

Optimization Features

Token Efficiency
task-dependent token budgets (Level1 ~5k tokens, Level2 ~25k, Level3 ~100k)
System Optimization
hierarchical decomposition to reduce end-to-end cost

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark is built from 310 DJIA reports, so domain coverage skews to large-cap firms and may not reflect SMEs or other sectors.

Evaluation relies on LLM judges with varying strictness, which introduces evaluator bias and variance.

When Not To Use

When web access or external search is unavailable; Level-2/3 performance depends on external evidence.

For non-ESG domains without retooling; the toolset and benchmarks are tailored to sustainability reporting.

Failure Modes

Hallucinated claims with plausible but unsupported citations if retrieval is weak.

Citation reuse patterns that artificially inflate correctness metrics (authors note reuse can bias scores).

Core Entities

Models

ESGAgentGemini-3-flashGPT-5.2GPT-5.1GPT-5GPT-4.1Deepseek-r1Grok deep researchPerplexity deep research

Metrics

AccuracyCitation CorrectnessCitation FaithfulnessAnalysis Effectiveness (richness, completeness, depth)Presentation Quality (coherence, professionalism, chart expressiveness)Report statistics (#pages, #words, #charts, #refs, #citations)

Datasets

ESG Benchmark: 310 DJIA sustainability reports (2010-2024)

Benchmarks

ESG Benchmark (3-level: Level1 atomic QA, Level2 compositional, Level3 analytical reports)