KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

July 31, 20248 min

Overview

Decision SnapshotNeeds Validation

The system demonstrates a practical pipeline (RAG + prompt engineering + fine-tuning) and measurable gains on evaluated Q&A, but accuracy and data coverage remain moderate and require ongoing human-in-the-loop processes.

Citations0

Evidence Strength0.70

Confidence0.88

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 40%

Authors

Gilang Fajar Febrian, Grazziela Figueredo

Links

Abstract / PDF

Why It Matters For Business

KemenkeuGPT can speed retrieval of government finance rules and data, reducing manual search time and helping staff make faster, evidence-based decisions while still needing human verification and more data coverage.

Who Should Care

Summary TLDR

This paper builds KemenkeuGPT, an application that combines LangChain, Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning to answer questions about Indonesian government finance and regulations. The team collected public data (2003–2023), ~180k aggregated transactions, and ~3k curated Q&A pairs, used GPT-3.5 Turbo as the base model, and iteratively improved the system with stakeholder feedback. Human-evaluated accuracy rose from 35% → 42% (added docs) → 60% (prompt engineering) → 61% (fine-tuning). RAGAS benchmarking shows correctness 0.44, faithfulness 0.73, precision 0.40, recall 0.60. The system is useful for speeding information access but still needs more data, aUI

Problem Statement

Government financial data and rules are large, changing, and dispersed. Manual searches are slow. Off-the-shelf LLMs give general answers and can hallucinate. The problem: make an LLM that returns accurate, sourced answers for Indonesian finance and can be iteratively improved with human feedback.

Main Contribution

Built KemenkeuGPT: a LangChain + RAG application focused on Indonesian government finance and regulations.

Assembled a dataset from 2003–2023 including 180,000 aggregated transactions and ~3,000 curated Q&A pairs for training and fine-tuning.

Key Findings

Iterative improvements increased human-evaluated accuracy from 35% to 61%.

Numbers35%42%60%61% (Table 8, human eval)

Practical UseAdd domain docs, refine prompts, and fine-tune: expect sizable accuracy gains but not perfect results.

Evidence RefSection 5, Table 8

LLM-based correctness rose from 48% (base) to 64% (KemenkeuGPT).

Numbers48%64% (LLM-based eval, Fig.6)

Practical UseFine-tuning plus RAG increases factual match to ground truth on evaluated items; still requires human checks.

Evidence RefSection 5, Figure 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyBase 35% → RAG 42% → Prompt eng. 60% → Fine-tuned 61%Base model accuracy 35%+26 percentage points (35%61%)Human-annotated Q&A set (survey + scraped Q&A)Section 4.3 Table 5–7; Section 5 Table 8Table 8
LLM-based correctnessBase 48% → KemenkeuGPT 64%Base model GPT-3.5 Turbo+16 percentage pointsLLM string-evaluator (LangChain)Section 5, Figure 6Figure 6

What To Try In 7 Days

Run a small LangChain+Pinecone RAG demo with 100–500 domain docs.

Collect ~200 representative Q&A pairs and use them for instruction tuning or few-shot prompts.

Add a simple feedback button and route responses back for expert review within your team.

Optimization Features

Infra Optimization
Experimented with Google Cloud Storage and scalable vector DB (Pinecone)
Model Optimization
Fine-tuning GPT-3.5 Turbo on curated Q&A
System Optimization
Use of vector DBs (Pinecone/Chroma) and chunked retrieval
Training Optimization
SFT
Inference Optimization
Prompt engineering to format outputs and enforce language matching

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Model trained on partial and aggregate government data; detailed transactions unavailable due to data access limits.

System cannot auto-update data; manual uploads required.

When Not To Use

For fully automated, unreviewed policy decisions or legal rulings.

When real-time transaction-level details are required (data not available).

Failure Modes

Hallucinated answers despite high faithfulness score on some queries.

Outdated responses if dataset is not regularly refreshed.

Core Entities

Models

OpenAI GPT-3.5 TurboLlama-2-7b-chat-hfTitan Text G1 ExpressTitan Text G1 LiteMistral 7BMixtral 8X7B InstructJurassic-2 Ultraft:gpt-3.5-turbo-0125:personal::9RAg3mNo (fine-tuned)

Metrics

Accuracycorrectnessfaithfulnessprecisionrecallresponse time (s)

Datasets

Ministry of Finance of Republic of Indonesia documents (2003–2023)Statistics Indonesia publicationsIMF dataMinistry proprietary aggregated transactions (2014–2023, ~180k records)Scraped Q&A from Ministry website (~1,688 valid pairs)Directorate General of Customs and Excise Q&A (~1,299 pairs)

Benchmarks

RAGAS (faithfulness, correctness, precision, recall)Human evaluation (Ministry staff)