Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.
Summary TLDR
This paper presents CarbonChat, a practical LLM-based system that reads long sustainability reports and policy documents, extracts structured data, runs hybrid retrieval, and answers corporate carbon-emissions questions. Key pieces: diversified document chunking, a Self-Prompting Retrieval-Augmented-Generation (RAG) pipeline, and a schema-aware Text2SQL module with security checks. Experiments (Qwen-Max backbone) show improved text metrics (ROUGE/L and BERTScore) and high Text2SQL execution accuracy (EX 89.2%, EM 79.9%). The system adds timestamps, source paths and a hallucination tag to improve traceability but still requires human review for high-stakes use.
Problem Statement
Enterprises and analysts face long, complex sustainability reports and fragmented policy texts. Off-the-shelf LLMs lack up-to-date domain data and struggle with long documents, structured tables, and SQL-style queries, producing hallucinations and costly manual analysis. CarbonChat aims to automate extraction, structured questioning, and traceable answers aligned with the GHG Protocol.
Main Contribution
Diversified index module for document chunking and structured extraction (document tree, rule-based, semantic, paragraph/sliding window, table/image/formula handling).
Self-Prompting RAG architecture combining intent recognition, structured chain-of-thought prompts, hybrid BM25+embedding retrieval, re-ranking, and key-sentence extraction.
Schema-aware Text2SQL pipeline: LLM-guided SQL generation, schema matching, security validation (whitelists, syntax checks), and automated SQL repair.
14-dimension GHG Protocol-based analysis framework for report summarization, compliance scoring (0–100), and customized Q&A with source tracing.
Practical features for verifiability: timestamps, version control, multi-layer chunking, and hallucination tagging linked to source sentences.
Key Findings
Self-Prompting RAG improves text-generation metrics vs standard RAG.
Text2SQL module produces high execution accuracy on table queries.
Ablations show core modules strongly affect SQL accuracy.
Results
ROUGE-1 (Qwen-Max, Self-Prompting RAG)
BERTScore F1 (Qwen-Max, Self-Prompting RAG)
Accuracy
Text2SQL Exact Match (EM)
Ablation — remove time/table localization
Who Should Care
What To Try In 7 Days
Run hybrid retrieval (BM25 + embeddings) over a few company reports to compare precision vs BM25-only.
Prototype a Text2SQL path for one internal database table with schema-aware prompts and whitelist checks.
Add source-path and timestamp fields to retrieved passages so every LLM answer can be traced back to a document and version.
Optimization Features
System Optimization
- prompt engineering
- progressive context trimming when prompt > 3000 tokens
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses Qwen-Max backbone and static vector DB; effectiveness with other LLMs or live DBs is not fully shown.
- Paper focuses on Chinese policy corpus and public reports; cross-jurisdiction generalization is untested.
- Hallucination tagging reduces but does not eliminate incorrect or unverifiable claims; human verification still required.
- Planned code/data release is pending, limiting immediate reproducibility.
When Not To Use
- Do not use as sole evidence for legal or financial audits without expert verification.
- Avoid relying on it for real-time regulatory compliance decisions until live-data and governance are validated.
- Not suitable when company databases lack consistent schemas or have heavily redacted/confidential fields.
Failure Modes
- LLM-generated SQL that is syntactically valid but semantically wrong for business logic.
- Table extraction errors from complex PDFs (merged cells, broken headers) causing incorrect answers.
- Outdated policy or report versions producing stale guidance if timestamps are missing or misapplied.
- Hallucinated assertions not covered by source passages despite tagging heuristics.
Core Entities
Models
- Qwen-Max
- ChatGPT-4o-2024-05-13
- GLM-4
- Spark 4.0 Ultra
- Baidu ERNIE-4.0-Turbo
- Llama-3.1-70B-Instruct
- BGE-M3-Embedding
- BGE-reranker-large
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BERTScore Precision
- BERTScore Recall
- BERTScore F1
- EX
- EM
Datasets
- 1000 Chinese policy/regulatory docs (2018-2024)
- 1180 QA pairs (test set)
- 100 corporate environmental reports
- 2,133 table-aware QA annotated pairs
Benchmarks
- ROUGE
- BERTScore
- Accuracy
- Exact Match (EM)

