Overview
The system is practical for enterprise ESG workflows because it combines retrieval, domain tools, and iterative planning; however it depends on web access, closed-source model components, and LLM judges whose biases must be managed.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Automating professional-grade ESG audits requires retrieval, web research, and domain tools; a specialized agent produces more verifiable and visualization-rich reports than off-the-shelf LLMs, improving auditability and decision support.
Who Should Care
Summary TLDR
This paper introduces ESGAgent, a hierarchical multi-agent system with domain tools (retriever, web browser, Python interpreter, plotter, report generator) for professional sustainability audits. The authors also release a three-level ESG Benchmark built from 310 Dow Jones company sustainability reports (2010–2024). ESGAgent beats strong closed-source LLM baselines on factoid and compositional tasks (84.15% overall accuracy on Levels 1–2) and produces more chart- and citation-rich Level-3 reports; ablations show external web search and retrieval are key to its gains. The benchmark evaluates correctness, faithfulness, analysis depth, and presentation quality using a multi-judge LLM ensemble.
Problem Statement
ESG analysis needs deep, cross-document, multimodal reasoning and quantitative checks, but data is fragmented and standard LLMs or static benchmarks do not capture expert multi-step workflows and report-level rigor needed for professional sustainability auditing.
Main Contribution
ESGAgent: a hierarchical multi-agent system with a domain toolset for professional ESG analysis (retriever, deep researcher/browser, python interpreter, plotter, reporter).
ESG Benchmark: a curated three-level benchmark from 310 DJIA sustainability reports (2010–2024) that spans atomic QA to open-ended analytical report generation.
Key Findings
ESGAgent achieves 84.15% overall accuracy on Level 1–2 tasks, outperforming Gemini-3-flash (80.89%).
Removing the deep research (web search) module drops Level-2 accuracy from 77.19% to 65.79%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 90.15% | — | — | Level 1 tasks (117/119 correct) | Table 3 reports Level-1 acc 90.15% (119 correct out of 132?) | Table 3 |
| Accuracy | 77.19% | — | — | Level 2 tasks (88 correct) | Table 3 reports Level-2 acc 77.19% (88 correct) | Table 3 |
What To Try In 7 Days
Index your target company's sustainability reports into a vector DB and run a small RAG pipeline to surface evidence for 5 factual ESG questions.
Prototype a simple planner that splits an audit task into: extract disclosures, compute one KPI (e.g., WACI), and render a chart.
Run an ablation test: compare outputs with and without external web search to see the real-time evidence gap.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmark is built from 310 DJIA reports, so domain coverage skews to large-cap firms and may not reflect SMEs or other sectors.
Evaluation relies on LLM judges with varying strictness, which introduces evaluator bias and variance.
When Not To Use
When web access or external search is unavailable; Level-2/3 performance depends on external evidence.
For non-ESG domains without retooling; the toolset and benchmarks are tailored to sustainability reporting.
Failure Modes
Hallucinated claims with plausible but unsupported citations if retrieval is weak.
Citation reuse patterns that artificially inflate correctness metrics (authors note reuse can bias scores).

