CarbonChat: LLM system for corporate carbon-emissions analysis using hybrid RAG and Text2SQL

January 3, 20257 min

Overview

Decision SnapshotNeeds Validation

The system assembles known components (hybrid retrieval, COT prompts, Text2SQL) into a practical pipeline and provides quantitative gains; human review is still needed for high-stakes outputs.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zhixuan Cao, Ming Han, Jingtao Wang, Meng Jia

Links

Abstract / PDF

Why It Matters For Business

CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.

Who Should Care

Summary TLDR

This paper presents CarbonChat, a practical LLM-based system that reads long sustainability reports and policy documents, extracts structured data, runs hybrid retrieval, and answers corporate carbon-emissions questions. Key pieces: diversified document chunking, a Self-Prompting Retrieval-Augmented-Generation (RAG) pipeline, and a schema-aware Text2SQL module with security checks. Experiments (Qwen-Max backbone) show improved text metrics (ROUGE/L and BERTScore) and high Text2SQL execution accuracy (EX 89.2%, EM 79.9%). The system adds timestamps, source paths and a hallucination tag to improve traceability but still requires human review for high-stakes use.

Problem Statement

Enterprises and analysts face long, complex sustainability reports and fragmented policy texts. Off-the-shelf LLMs lack up-to-date domain data and struggle with long documents, structured tables, and SQL-style queries, producing hallucinations and costly manual analysis. CarbonChat aims to automate extraction, structured questioning, and traceable answers aligned with the GHG Protocol.

Main Contribution

Diversified index module for document chunking and structured extraction (document tree, rule-based, semantic, paragraph/sliding window, table/image/formula handling).

Self-Prompting RAG architecture combining intent recognition, structured chain-of-thought prompts, hybrid BM25+embedding retrieval, re-ranking, and key-sentence extraction.

Key Findings

Self-Prompting RAG improves text-generation metrics vs standard RAG.

NumbersQwen-Max ROUGE-1 0.592 vs 0.529; BERTScore F1 0.906 vs 0.831

Practical UseUse structured COT + hybrid retrieval to get measurably better, more accurate summaries and answers from LLMs on carbon reporting.

Evidence RefTable 2

Text2SQL module produces high execution accuracy on table queries.

NumbersText2SQL EX 89.2%, EM 79.9%

Practical UseA schema-aware Text2SQL step can reliably convert natural questions into correct SQL for company databases in most cases — still validate critical queries.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-1 (Qwen-Max, Self-Prompting RAG)0.592Standard RAG 0.529+0.063Text content evaluation (Table 2)Table 2 shows Qwen-Max Self-Prompting RAG ROUGE-1 0.592Table 2
BERTScore F1 (Qwen-Max, Self-Prompting RAG)0.906Standard RAG 0.831+0.075Text content evaluation (Table 2)Table 2 reports BERTScore F1 0.906 for Qwen-Max Self-Prompting RAGTable 2

What To Try In 7 Days

Run hybrid retrieval (BM25 + embeddings) over a few company reports to compare precision vs BM25-only.

Prototype a Text2SQL path for one internal database table with schema-aware prompts and whitelist checks.

Add source-path and timestamp fields to retrieved passages so every LLM answer can be traced back to a document and version.

Optimization Features

System Optimization
prompt engineeringprogressive context trimming when prompt > 3000 tokens

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses Qwen-Max backbone and static vector DB; effectiveness with other LLMs or live DBs is not fully shown.

Paper focuses on Chinese policy corpus and public reports; cross-jurisdiction generalization is untested.

When Not To Use

Do not use as sole evidence for legal or financial audits without expert verification.

Avoid relying on it for real-time regulatory compliance decisions until live-data and governance are validated.

Failure Modes

LLM-generated SQL that is syntactically valid but semantically wrong for business logic.

Table extraction errors from complex PDFs (merged cells, broken headers) causing incorrect answers.

Core Entities

Models

Qwen-MaxChatGPT-4o-2024-05-13GLM-4Spark 4.0 UltraBaidu ERNIE-4.0-TurboLlama-3.1-70B-InstructBGE-M3-EmbeddingBGE-reranker-large

Metrics

ROUGE-1ROUGE-2ROUGE-LBERTScore PrecisionBERTScore RecallBERTScore F1EXEM

Datasets

1000 Chinese policy/regulatory docs (2018-2024)1180 QA pairs (test set)100 corporate environmental reports2,133 table-aware QA annotated pairs

Benchmarks

ROUGEBERTScoreAccuracyExact Match (EM)