CarbonChat: LLM system for corporate carbon-emissions analysis using hybrid RAG and Text2SQL

January 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Zhixuan Cao, Ming Han, Jingtao Wang, Meng Jia

Links

Abstract / PDF

Why It Matters For Business

CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.

Summary TLDR

This paper presents CarbonChat, a practical LLM-based system that reads long sustainability reports and policy documents, extracts structured data, runs hybrid retrieval, and answers corporate carbon-emissions questions. Key pieces: diversified document chunking, a Self-Prompting Retrieval-Augmented-Generation (RAG) pipeline, and a schema-aware Text2SQL module with security checks. Experiments (Qwen-Max backbone) show improved text metrics (ROUGE/L and BERTScore) and high Text2SQL execution accuracy (EX 89.2%, EM 79.9%). The system adds timestamps, source paths and a hallucination tag to improve traceability but still requires human review for high-stakes use.

Problem Statement

Enterprises and analysts face long, complex sustainability reports and fragmented policy texts. Off-the-shelf LLMs lack up-to-date domain data and struggle with long documents, structured tables, and SQL-style queries, producing hallucinations and costly manual analysis. CarbonChat aims to automate extraction, structured questioning, and traceable answers aligned with the GHG Protocol.

Main Contribution

Diversified index module for document chunking and structured extraction (document tree, rule-based, semantic, paragraph/sliding window, table/image/formula handling).

Self-Prompting RAG architecture combining intent recognition, structured chain-of-thought prompts, hybrid BM25+embedding retrieval, re-ranking, and key-sentence extraction.

Schema-aware Text2SQL pipeline: LLM-guided SQL generation, schema matching, security validation (whitelists, syntax checks), and automated SQL repair.

14-dimension GHG Protocol-based analysis framework for report summarization, compliance scoring (0–100), and customized Q&A with source tracing.

Practical features for verifiability: timestamps, version control, multi-layer chunking, and hallucination tagging linked to source sentences.

Key Findings

Self-Prompting RAG improves text-generation metrics vs standard RAG.

NumbersQwen-Max ROUGE-1 0.592 vs 0.529; BERTScore F1 0.906 vs 0.831

Text2SQL module produces high execution accuracy on table queries.

NumbersText2SQL EX 89.2%, EM 79.9%

Ablations show core modules strongly affect SQL accuracy.

NumbersRemoving time/table loc drops EX by 21.7% and EM by 19.6%; removing few-shot COT drops EX by 15.8% and EM by 17.3%

Results

ROUGE-1 (Qwen-Max, Self-Prompting RAG)

Value0.592

BaselineStandard RAG 0.529

BERTScore F1 (Qwen-Max, Self-Prompting RAG)

Value0.906

BaselineStandard RAG 0.831

Accuracy

Value89.2%

Text2SQL Exact Match (EM)

Value79.9%

Ablation — remove time/table localization

ValueEX 67.5%, EM 60.3%

BaselineEX 89.2%, EM 79.9%

Who Should Care

What To Try In 7 Days

Run hybrid retrieval (BM25 + embeddings) over a few company reports to compare precision vs BM25-only.

Prototype a Text2SQL path for one internal database table with schema-aware prompts and whitelist checks.

Add source-path and timestamp fields to retrieved passages so every LLM answer can be traced back to a document and version.

Optimization Features

System Optimization

  • prompt engineering
  • progressive context trimming when prompt > 3000 tokens

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses Qwen-Max backbone and static vector DB; effectiveness with other LLMs or live DBs is not fully shown.
  • Paper focuses on Chinese policy corpus and public reports; cross-jurisdiction generalization is untested.
  • Hallucination tagging reduces but does not eliminate incorrect or unverifiable claims; human verification still required.
  • Planned code/data release is pending, limiting immediate reproducibility.

When Not To Use

  • Do not use as sole evidence for legal or financial audits without expert verification.
  • Avoid relying on it for real-time regulatory compliance decisions until live-data and governance are validated.
  • Not suitable when company databases lack consistent schemas or have heavily redacted/confidential fields.

Failure Modes

  • LLM-generated SQL that is syntactically valid but semantically wrong for business logic.
  • Table extraction errors from complex PDFs (merged cells, broken headers) causing incorrect answers.
  • Outdated policy or report versions producing stale guidance if timestamps are missing or misapplied.
  • Hallucinated assertions not covered by source passages despite tagging heuristics.

Core Entities

Models

  • Qwen-Max
  • ChatGPT-4o-2024-05-13
  • GLM-4
  • Spark 4.0 Ultra
  • Baidu ERNIE-4.0-Turbo
  • Llama-3.1-70B-Instruct
  • BGE-M3-Embedding
  • BGE-reranker-large

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BERTScore Precision
  • BERTScore Recall
  • BERTScore F1
  • EX
  • EM

Datasets

  • 1000 Chinese policy/regulatory docs (2018-2024)
  • 1180 QA pairs (test set)
  • 100 corporate environmental reports
  • 2,133 table-aware QA annotated pairs

Benchmarks

  • ROUGE
  • BERTScore
  • Accuracy
  • Exact Match (EM)