Overview
The system assembles known components (hybrid retrieval, COT prompts, Text2SQL) into a practical pipeline and provides quantitative gains; human review is still needed for high-stakes outputs.
Citations0
Evidence Strength0.75
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.
Who Should Care
Summary TLDR
This paper presents CarbonChat, a practical LLM-based system that reads long sustainability reports and policy documents, extracts structured data, runs hybrid retrieval, and answers corporate carbon-emissions questions. Key pieces: diversified document chunking, a Self-Prompting Retrieval-Augmented-Generation (RAG) pipeline, and a schema-aware Text2SQL module with security checks. Experiments (Qwen-Max backbone) show improved text metrics (ROUGE/L and BERTScore) and high Text2SQL execution accuracy (EX 89.2%, EM 79.9%). The system adds timestamps, source paths and a hallucination tag to improve traceability but still requires human review for high-stakes use.
Problem Statement
Enterprises and analysts face long, complex sustainability reports and fragmented policy texts. Off-the-shelf LLMs lack up-to-date domain data and struggle with long documents, structured tables, and SQL-style queries, producing hallucinations and costly manual analysis. CarbonChat aims to automate extraction, structured questioning, and traceable answers aligned with the GHG Protocol.
Main Contribution
Diversified index module for document chunking and structured extraction (document tree, rule-based, semantic, paragraph/sliding window, table/image/formula handling).
Self-Prompting RAG architecture combining intent recognition, structured chain-of-thought prompts, hybrid BM25+embedding retrieval, re-ranking, and key-sentence extraction.
Key Findings
Self-Prompting RAG improves text-generation metrics vs standard RAG.
Text2SQL module produces high execution accuracy on table queries.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-1 (Qwen-Max, Self-Prompting RAG) | 0.592 | Standard RAG 0.529 | +0.063 | Text content evaluation (Table 2) | Table 2 shows Qwen-Max Self-Prompting RAG ROUGE-1 0.592 | Table 2 |
| BERTScore F1 (Qwen-Max, Self-Prompting RAG) | 0.906 | Standard RAG 0.831 | +0.075 | Text content evaluation (Table 2) | Table 2 reports BERTScore F1 0.906 for Qwen-Max Self-Prompting RAG | Table 2 |
What To Try In 7 Days
Run hybrid retrieval (BM25 + embeddings) over a few company reports to compare precision vs BM25-only.
Prototype a Text2SQL path for one internal database table with schema-aware prompts and whitelist checks.
Add source-path and timestamp fields to retrieved passages so every LLM answer can be traced back to a document and version.
Optimization Features
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation uses Qwen-Max backbone and static vector DB; effectiveness with other LLMs or live DBs is not fully shown.
Paper focuses on Chinese policy corpus and public reports; cross-jurisdiction generalization is untested.
When Not To Use
Do not use as sole evidence for legal or financial audits without expert verification.
Avoid relying on it for real-time regulatory compliance decisions until live-data and governance are validated.
Failure Modes
LLM-generated SQL that is syntactically valid but semantically wrong for business logic.
Table extraction errors from complex PDFs (merged cells, broken headers) causing incorrect answers.

