KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

July 31, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

0

Authors

Gilang Fajar Febrian, Grazziela Figueredo

Links

Abstract / PDF

Why It Matters For Business

KemenkeuGPT can speed retrieval of government finance rules and data, reducing manual search time and helping staff make faster, evidence-based decisions while still needing human verification and more data coverage.

Summary TLDR

This paper builds KemenkeuGPT, an application that combines LangChain, Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning to answer questions about Indonesian government finance and regulations. The team collected public data (2003–2023), ~180k aggregated transactions, and ~3k curated Q&A pairs, used GPT-3.5 Turbo as the base model, and iteratively improved the system with stakeholder feedback. Human-evaluated accuracy rose from 35% → 42% (added docs) → 60% (prompt engineering) → 61% (fine-tuning). RAGAS benchmarking shows correctness 0.44, faithfulness 0.73, precision 0.40, recall 0.60. The system is useful for speeding information access but still needs more data, aUI

Problem Statement

Government financial data and rules are large, changing, and dispersed. Manual searches are slow. Off-the-shelf LLMs give general answers and can hallucinate. The problem: make an LLM that returns accurate, sourced answers for Indonesian finance and can be iteratively improved with human feedback.

Main Contribution

Built KemenkeuGPT: a LangChain + RAG application focused on Indonesian government finance and regulations.

Assembled a dataset from 2003–2023 including 180,000 aggregated transactions and ~3,000 curated Q&A pairs for training and fine-tuning.

Iterative pipeline: compare base LLMs, add documents to RAG, apply prompt engineering, fine-tune GPT-3.5 Turbo, and collect human feedback.

Measured performance with human evaluation, LLM-based checks, and the RAGAS benchmark (faithfulness, correctness, precision, recall).

Delivered a multi-platform UI and a feedback loop for continuous improvement.

Key Findings

Iterative improvements increased human-evaluated accuracy from 35% to 61%.

Numbers35% → 42% → 60% → 61% (Table 8, human eval)

LLM-based correctness rose from 48% (base) to 64% (KemenkeuGPT).

Numbers48% → 64% (LLM-based eval, Fig.6)

RAGAS benchmark: KemenkeuGPT scored 0.44 correctness, 0.73 faithfulness, 0.40 precision, 0.60 recall.

Numberscorrectness 0.44; faithfulness 0.73; precision 0.40; recall 0.60 (Table 10)

Adding documents increased response time from ~1.53s to ~10.93s during prompt-engineered stage.

NumbersResponse time: 1.529s (base) → 2.269s (RAG) → 10.927s (prompt engine) → 10.526s (fine-tuned) (Table 9)

GPT-3.5 Turbo (fine-tuned) outperformed six other tested LLMs on this task.

NumbersGPT-3.5 Turbo final accuracy 61%; other base models 4%–35% (Table 4, Table 7)

Results

Accuracy

ValueBase 35% → RAG 42% → Prompt eng. 60% → Fine-tuned 61%

BaselineBase model accuracy 35%

LLM-based correctness

ValueBase 48% → KemenkeuGPT 64%

BaselineBase model GPT-3.5 Turbo

RAGAS scores (KemenkeuGPT)

ValueCorrectness 0.44; Faithfulness 0.73; Precision 0.40; Recall 0.60

BaselineOther models 0.19–0.42 correctness; faithfulness 0.20–0.71

Response time

Value1.53s (base) → 2.27s (RAG) → 10.93s (prompt engineering) → 10.53s (fine-tuned)

BaselineBase 1.53s

Who Should Care

What To Try In 7 Days

Run a small LangChain+Pinecone RAG demo with 100–500 domain docs.

Collect ~200 representative Q&A pairs and use them for instruction tuning or few-shot prompts.

Add a simple feedback button and route responses back for expert review within your team.

Optimization Features

Infra Optimization

  • Experimented with Google Cloud Storage and scalable vector DB (Pinecone)

Model Optimization

  • Fine-tuning GPT-3.5 Turbo on curated Q&A

System Optimization

  • Use of vector DBs (Pinecone/Chroma) and chunked retrieval

Training Optimization

  • SFT

Inference Optimization

  • Prompt engineering to format outputs and enforce language matching

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model trained on partial and aggregate government data; detailed transactions unavailable due to data access limits.
  • System cannot auto-update data; manual uploads required.
  • Accuracy remains below 70% on evaluated items; not yet reliable for fully automated decisions.
  • Response latency increases with a larger retrieval corpus and complex prompt pipelines.
  • User interface is basic and limited to web/mobile views.

When Not To Use

  • For fully automated, unreviewed policy decisions or legal rulings.
  • When real-time transaction-level details are required (data not available).
  • As the single source of truth without human expert verification.

Failure Modes

  • Hallucinated answers despite high faithfulness score on some queries.
  • Outdated responses if dataset is not regularly refreshed.
  • Missing or incomplete retrieval context leading to incorrect answers.
  • Slower responses as the RAG corpus grows or prompts become heavier.

Core Entities

Models

  • OpenAI GPT-3.5 Turbo
  • Llama-2-7b-chat-hf
  • Titan Text G1 Express
  • Titan Text G1 Lite
  • Mistral 7B
  • Mixtral 8X7B Instruct
  • Jurassic-2 Ultra
  • ft:gpt-3.5-turbo-0125:personal::9RAg3mNo (fine-tuned)

Metrics

  • Accuracy
  • correctness
  • faithfulness
  • precision
  • recall
  • response time (s)

Datasets

  • Ministry of Finance of Republic of Indonesia documents (2003–2023)
  • Statistics Indonesia publications
  • IMF data
  • Ministry proprietary aggregated transactions (2014–2023, ~180k records)
  • Scraped Q&A from Ministry website (~1,688 valid pairs)
  • Directorate General of Customs and Excise Q&A (~1,299 pairs)

Benchmarks

  • RAGAS (faithfulness, correctness, precision, recall)
  • Human evaluation (Ministry staff)