A hierarchical multi-agent ESG analyst plus a 3-level benchmark built from 310 corporate sustainability reports

January 13, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Yilei Zhao, Wentao Zhang, Lei Xiao, Yandan Zheng, Mengpu Liu, Wei Yang Bryan Lim

Links

Abstract / PDF

Why It Matters For Business

Automating professional-grade ESG audits requires retrieval, web research, and domain tools; a specialized agent produces more verifiable and visualization-rich reports than off-the-shelf LLMs, improving auditability and decision support.

Summary TLDR

This paper introduces ESGAgent, a hierarchical multi-agent system with domain tools (retriever, web browser, Python interpreter, plotter, report generator) for professional sustainability audits. The authors also release a three-level ESG Benchmark built from 310 Dow Jones company sustainability reports (2010–2024). ESGAgent beats strong closed-source LLM baselines on factoid and compositional tasks (84.15% overall accuracy on Levels 1–2) and produces more chart- and citation-rich Level-3 reports; ablations show external web search and retrieval are key to its gains. The benchmark evaluates correctness, faithfulness, analysis depth, and presentation quality using a multi-judge LLM ensemble.

Problem Statement

ESG analysis needs deep, cross-document, multimodal reasoning and quantitative checks, but data is fragmented and standard LLMs or static benchmarks do not capture expert multi-step workflows and report-level rigor needed for professional sustainability auditing.

Main Contribution

ESGAgent: a hierarchical multi-agent system with a domain toolset for professional ESG analysis (retriever, deep researcher/browser, python interpreter, plotter, reporter).

ESG Benchmark: a curated three-level benchmark from 310 DJIA sustainability reports (2010–2024) that spans atomic QA to open-ended analytical report generation.

Empirical evaluation showing ESGAgent outperforms multiple closed-source LLM baselines on Levels 1–3 and ablations that quantify the value of retrieval and web research.

Key Findings

ESGAgent achieves 84.15% overall accuracy on Level 1–2 tasks, outperforming Gemini-3-flash (80.89%).

NumbersTotal Acc 84.15% vs 80.89% (Table 3)

Removing the deep research (web search) module drops Level-2 accuracy from 77.19% to 65.79%.

NumbersLevel-2 acc: 77.19% -> 65.79% without deep research (Table 3)

ESGAgent produces more actionable reports measured by charts and citations: average 3.5 charts and 38.3 citation insertions per report.

Numbers# charts 3.5; # cite. 38.3 (Table 5)

Level-3 factual consistency (citation correctness) for ESGAgent is 0.93 and citation faithfulness 0.805, leading to strongest multi-dimensional averages among systems evaluated.

NumbersCorr. 0.930; Faith. 0.805; Avg. score 8.096 (Table 4)

Results

Accuracy

Value90.15%

Accuracy

Value77.19%

Accuracy

Value84.15%

BaselineGemini-3-flash 80.89%

Level-3 factual consistency (citation correctness / faithfulness)

Value0.930 / 0.805

Report richness (#charts, #citations) average

Value3.5 charts; 38.3 citation insertions

BaselineGemini-3-pro-DR: 2.83 charts; 24.3 citations

Who Should Care

What To Try In 7 Days

Index your target company's sustainability reports into a vector DB and run a small RAG pipeline to surface evidence for 5 factual ESG questions.

Prototype a simple planner that splits an audit task into: extract disclosures, compute one KPI (e.g., WACI), and render a chart.

Run an ablation test: compare outputs with and without external web search to see the real-time evidence gap.

Agent Features

Memory

  • RAG with LightRAG index
  • knowledge graph augmentation
  • centralized general memory for orchestration
  • tool-level local memory

Planning

  • top-level planner decomposes user queries into sub-tasks
  • iterative re-execution and refinement loop for unsatisfied sub-tasks

Tool Use

  • retriever over local vector DB
  • web deep researcher/browser
  • python interpreter for calculations
  • plotter for visualizations
  • report tool for final assembly

Frameworks

  • LightRAG
  • Knowledge Graph reasoning

Is Agentic

true

Architectures

  • hierarchical multi-agent

Collaboration

  • specialized sub-agents for domain tasks
  • coordinator/orchestrator that synthesizes observations

Optimization Features

Token Efficiency

  • task-dependent token budgets (Level1 ~5k tokens, Level2 ~25k, Level3 ~100k)

System Optimization

  • hierarchical decomposition to reduce end-to-end cost

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark is built from 310 DJIA reports, so domain coverage skews to large-cap firms and may not reflect SMEs or other sectors.
  • Evaluation relies on LLM judges with varying strictness, which introduces evaluator bias and variance.
  • Agent depends on web access and closed-source model previews (e.g., gemini-3-flash-preview), limiting reproducible deployment without similar model access.

When Not To Use

  • When web access or external search is unavailable; Level-2/3 performance depends on external evidence.
  • For non-ESG domains without retooling; the toolset and benchmarks are tailored to sustainability reporting.

Failure Modes

  • Hallucinated claims with plausible but unsupported citations if retrieval is weak.
  • Citation reuse patterns that artificially inflate correctness metrics (authors note reuse can bias scores).
  • Judge bias: single LLM evaluators can be lenient or strict; ensemble mitigates but does not eliminate variance.

Core Entities

Models

  • ESGAgent
  • Gemini-3-flash
  • GPT-5.2
  • GPT-5.1
  • GPT-5
  • GPT-4.1
  • Deepseek-r1
  • Grok deep research
  • Perplexity deep research

Metrics

  • Accuracy
  • Citation Correctness
  • Citation Faithfulness
  • Analysis Effectiveness (richness, completeness, depth)
  • Presentation Quality (coherence, professionalism, chart expressiveness)
  • Report statistics (#pages, #words, #charts, #refs, #citations)

Datasets

  • ESG Benchmark: 310 DJIA sustainability reports (2010-2024)

Benchmarks

  • ESG Benchmark (3-level: Level1 atomic QA, Level2 compositional, Level3 analytical reports)