CarbonChat: LLM system for corporate carbon-emissions analysis using hybrid RAG and Text2SQL

Overview

Decision SnapshotNeeds Validation

The system assembles known components (hybrid retrieval, COT prompts, Text2SQL) into a practical pipeline and provides quantitative gains; human review is still needed for high-stakes outputs.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zhixuan Cao, Ming Han, Jingtao Wang, Meng Jia

Links

Abstract / PDF

Why It Matters For Business

CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

This paper presents CarbonChat, a practical LLM-based system that reads long sustainability reports and policy documents, extracts structured data, runs hybrid retrieval, and answers corporate carbon-emissions questions. Key pieces: diversified document chunking, a Self-Prompting Retrieval-Augmented-Generation (RAG) pipeline, and a schema-aware Text2SQL module with security checks. Experiments (Qwen-Max backbone) show improved text metrics (ROUGE/L and BERTScore) and high Text2SQL execution accuracy (EX 89.2%, EM 79.9%). The system adds timestamps, source paths and a hallucination tag to improve traceability but still requires human review for high-stakes use.

Problem Statement

Enterprises and analysts face long, complex sustainability reports and fragmented policy texts. Off-the-shelf LLMs lack up-to-date domain data and struggle with long documents, structured tables, and SQL-style queries, producing hallucinations and costly manual analysis. CarbonChat aims to automate extraction, structured questioning, and traceable answers aligned with the GHG Protocol.

Main Contribution

Diversified index module for document chunking and structured extraction (document tree, rule-based, semantic, paragraph/sliding window, table/image/formula handling).

Self-Prompting RAG architecture combining intent recognition, structured chain-of-thought prompts, hybrid BM25+embedding retrieval, re-ranking, and key-sentence extraction.

Key Findings

Self-Prompting RAG improves text-generation metrics vs standard RAG.

NumbersQwen-Max ROUGE-1 0.592 vs 0.529; BERTScore F1 0.906 vs 0.831

Practical UseUse structured COT + hybrid retrieval to get measurably better, more accurate summaries and answers from LLMs on carbon reporting.

Evidence RefTable 2

Text2SQL module produces high execution accuracy on table queries.

NumbersText2SQL EX 89.2%, EM 79.9%

Practical UseA schema-aware Text2SQL step can reliably convert natural questions into correct SQL for company databases in most cases — still validate critical queries.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-1 (Qwen-Max, Self-Prompting RAG)	0.592	Standard RAG 0.529	+0.063	Text content evaluation (Table 2)	Table 2 shows Qwen-Max Self-Prompting RAG ROUGE-1 0.592	Table 2
BERTScore F1 (Qwen-Max, Self-Prompting RAG)	0.906	Standard RAG 0.831	+0.075	Text content evaluation (Table 2)	Table 2 reports BERTScore F1 0.906 for Qwen-Max Self-Prompting RAG	Table 2

What To Try In 7 Days

Run hybrid retrieval (BM25 + embeddings) over a few company reports to compare precision vs BM25-only.

Prototype a Text2SQL path for one internal database table with schema-aware prompts and whitelist checks.

Add source-path and timestamp fields to retrieved passages so every LLM answer can be traced back to a document and version.

Optimization Features

System Optimization

prompt engineeringprogressive context trimming when prompt > 3000 tokens

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses Qwen-Max backbone and static vector DB; effectiveness with other LLMs or live DBs is not fully shown.

Paper focuses on Chinese policy corpus and public reports; cross-jurisdiction generalization is untested.

When Not To Use

Do not use as sole evidence for legal or financial audits without expert verification.

Avoid relying on it for real-time regulatory compliance decisions until live-data and governance are validated.

Failure Modes

LLM-generated SQL that is syntactically valid but semantically wrong for business logic.

Table extraction errors from complex PDFs (merged cells, broken headers) causing incorrect answers.

Core Entities

Models

Qwen-MaxChatGPT-4o-2024-05-13GLM-4Spark 4.0 UltraBaidu ERNIE-4.0-TurboLlama-3.1-70B-InstructBGE-M3-EmbeddingBGE-reranker-large

Metrics

ROUGE-1ROUGE-2ROUGE-LBERTScore PrecisionBERTScore RecallBERTScore F1EXEM

Datasets

1000 Chinese policy/regulatory docs (2018-2024)1180 QA pairs (test set)100 corporate environmental reports2,133 table-aware QA annotated pairs

Benchmarks

ROUGEBERTScoreAccuracyExact Match (EM)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-Prompting RAG improves text-generation metrics vs standard RAG.

Text2SQL module produces high execution accuracy on table queries.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A realistic benchmark and frozen-web environment for testing web research agents

Key finding

GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

Key finding

Route simple queries straight to fast tools; use memory + planner only for complex job-career requests to cut latency and improve accuracy.

Key finding

SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

Key finding

DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

Key finding