Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Automating professional-grade ESG audits requires retrieval, web research, and domain tools; a specialized agent produces more verifiable and visualization-rich reports than off-the-shelf LLMs, improving auditability and decision support.
Summary TLDR
This paper introduces ESGAgent, a hierarchical multi-agent system with domain tools (retriever, web browser, Python interpreter, plotter, report generator) for professional sustainability audits. The authors also release a three-level ESG Benchmark built from 310 Dow Jones company sustainability reports (2010–2024). ESGAgent beats strong closed-source LLM baselines on factoid and compositional tasks (84.15% overall accuracy on Levels 1–2) and produces more chart- and citation-rich Level-3 reports; ablations show external web search and retrieval are key to its gains. The benchmark evaluates correctness, faithfulness, analysis depth, and presentation quality using a multi-judge LLM ensemble.
Problem Statement
ESG analysis needs deep, cross-document, multimodal reasoning and quantitative checks, but data is fragmented and standard LLMs or static benchmarks do not capture expert multi-step workflows and report-level rigor needed for professional sustainability auditing.
Main Contribution
ESGAgent: a hierarchical multi-agent system with a domain toolset for professional ESG analysis (retriever, deep researcher/browser, python interpreter, plotter, reporter).
ESG Benchmark: a curated three-level benchmark from 310 DJIA sustainability reports (2010–2024) that spans atomic QA to open-ended analytical report generation.
Empirical evaluation showing ESGAgent outperforms multiple closed-source LLM baselines on Levels 1–3 and ablations that quantify the value of retrieval and web research.
Key Findings
ESGAgent achieves 84.15% overall accuracy on Level 1–2 tasks, outperforming Gemini-3-flash (80.89%).
Removing the deep research (web search) module drops Level-2 accuracy from 77.19% to 65.79%.
ESGAgent produces more actionable reports measured by charts and citations: average 3.5 charts and 38.3 citation insertions per report.
Level-3 factual consistency (citation correctness) for ESGAgent is 0.93 and citation faithfulness 0.805, leading to strongest multi-dimensional averages among systems evaluated.
Results
Accuracy
Accuracy
Accuracy
Level-3 factual consistency (citation correctness / faithfulness)
Report richness (#charts, #citations) average
Who Should Care
What To Try In 7 Days
Index your target company's sustainability reports into a vector DB and run a small RAG pipeline to surface evidence for 5 factual ESG questions.
Prototype a simple planner that splits an audit task into: extract disclosures, compute one KPI (e.g., WACI), and render a chart.
Run an ablation test: compare outputs with and without external web search to see the real-time evidence gap.
Agent Features
Memory
- RAG with LightRAG index
- knowledge graph augmentation
- centralized general memory for orchestration
- tool-level local memory
Planning
- top-level planner decomposes user queries into sub-tasks
- iterative re-execution and refinement loop for unsatisfied sub-tasks
Tool Use
- retriever over local vector DB
- web deep researcher/browser
- python interpreter for calculations
- plotter for visualizations
- report tool for final assembly
Frameworks
- LightRAG
- Knowledge Graph reasoning
Is Agentic
true
Architectures
- hierarchical multi-agent
Collaboration
- specialized sub-agents for domain tasks
- coordinator/orchestrator that synthesizes observations
Optimization Features
Token Efficiency
- task-dependent token budgets (Level1 ~5k tokens, Level2 ~25k, Level3 ~100k)
System Optimization
- hierarchical decomposition to reduce end-to-end cost
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark is built from 310 DJIA reports, so domain coverage skews to large-cap firms and may not reflect SMEs or other sectors.
- Evaluation relies on LLM judges with varying strictness, which introduces evaluator bias and variance.
- Agent depends on web access and closed-source model previews (e.g., gemini-3-flash-preview), limiting reproducible deployment without similar model access.
When Not To Use
- When web access or external search is unavailable; Level-2/3 performance depends on external evidence.
- For non-ESG domains without retooling; the toolset and benchmarks are tailored to sustainability reporting.
Failure Modes
- Hallucinated claims with plausible but unsupported citations if retrieval is weak.
- Citation reuse patterns that artificially inflate correctness metrics (authors note reuse can bias scores).
- Judge bias: single LLM evaluators can be lenient or strict; ensemble mitigates but does not eliminate variance.
Core Entities
Models
- ESGAgent
- Gemini-3-flash
- GPT-5.2
- GPT-5.1
- GPT-5
- GPT-4.1
- Deepseek-r1
- Grok deep research
- Perplexity deep research
Metrics
- Accuracy
- Citation Correctness
- Citation Faithfulness
- Analysis Effectiveness (richness, completeness, depth)
- Presentation Quality (coherence, professionalism, chart expressiveness)
- Report statistics (#pages, #words, #charts, #refs, #citations)
Datasets
- ESG Benchmark: 310 DJIA sustainability reports (2010-2024)
Benchmarks
- ESG Benchmark (3-level: Level1 atomic QA, Level2 compositional, Level3 analytical reports)

