KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Overview

Decision SnapshotNeeds Validation

The system demonstrates a practical pipeline (RAG + prompt engineering + fine-tuning) and measurable gains on evaluated Q&A, but accuracy and data coverage remain moderate and require ongoing human-in-the-loop processes.

Citations0

Evidence Strength0.70

Confidence0.88

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 40%

Authors

Gilang Fajar Febrian, Grazziela Figueredo

Links

Abstract / PDF

Why It Matters For Business

KemenkeuGPT can speed retrieval of government finance rules and data, reducing manual search time and helping staff make faster, evidence-based decisions while still needing human verification and more data coverage.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist CEO

Summary TLDR

This paper builds KemenkeuGPT, an application that combines LangChain, Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning to answer questions about Indonesian government finance and regulations. The team collected public data (2003–2023), ~180k aggregated transactions, and ~3k curated Q&A pairs, used GPT-3.5 Turbo as the base model, and iteratively improved the system with stakeholder feedback. Human-evaluated accuracy rose from 35% → 42% (added docs) → 60% (prompt engineering) → 61% (fine-tuning). RAGAS benchmarking shows correctness 0.44, faithfulness 0.73, precision 0.40, recall 0.60. The system is useful for speeding information access but still needs more data, aUI

Problem Statement

Government financial data and rules are large, changing, and dispersed. Manual searches are slow. Off-the-shelf LLMs give general answers and can hallucinate. The problem: make an LLM that returns accurate, sourced answers for Indonesian finance and can be iteratively improved with human feedback.

Main Contribution

Built KemenkeuGPT: a LangChain + RAG application focused on Indonesian government finance and regulations.

Assembled a dataset from 2003–2023 including 180,000 aggregated transactions and ~3,000 curated Q&A pairs for training and fine-tuning.

Key Findings

Iterative improvements increased human-evaluated accuracy from 35% to 61%.

Numbers35% → 42% → 60% → 61% (Table 8, human eval)

Practical UseAdd domain docs, refine prompts, and fine-tune: expect sizable accuracy gains but not perfect results.

Evidence RefSection 5, Table 8

LLM-based correctness rose from 48% (base) to 64% (KemenkeuGPT).

Numbers48% → 64% (LLM-based eval, Fig.6)

Practical UseFine-tuning plus RAG increases factual match to ground truth on evaluated items; still requires human checks.

Evidence RefSection 5, Figure 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Base 35% → RAG 42% → Prompt eng. 60% → Fine-tuned 61%	Base model accuracy 35%	+26 percentage points (35%→61%)	Human-annotated Q&A set (survey + scraped Q&A)	Section 4.3 Table 5–7; Section 5 Table 8	Table 8
LLM-based correctness	Base 48% → KemenkeuGPT 64%	Base model GPT-3.5 Turbo	+16 percentage points	LLM string-evaluator (LangChain)	Section 5, Figure 6	Figure 6

What To Try In 7 Days

Run a small LangChain+Pinecone RAG demo with 100–500 domain docs.

Collect ~200 representative Q&A pairs and use them for instruction tuning or few-shot prompts.

Add a simple feedback button and route responses back for expert review within your team.

Optimization Features

Infra Optimization

Experimented with Google Cloud Storage and scalable vector DB (Pinecone)

Model Optimization

Fine-tuning GPT-3.5 Turbo on curated Q&A

System Optimization

Use of vector DBs (Pinecone/Chroma) and chunked retrieval

Training Optimization

SFT

Inference Optimization

Prompt engineering to format outputs and enforce language matching

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Model trained on partial and aggregate government data; detailed transactions unavailable due to data access limits.

System cannot auto-update data; manual uploads required.

When Not To Use

For fully automated, unreviewed policy decisions or legal rulings.

When real-time transaction-level details are required (data not available).

Failure Modes

Hallucinated answers despite high faithfulness score on some queries.

Outdated responses if dataset is not regularly refreshed.

Core Entities

Models

OpenAI GPT-3.5 TurboLlama-2-7b-chat-hfTitan Text G1 ExpressTitan Text G1 LiteMistral 7BMixtral 8X7B InstructJurassic-2 Ultraft:gpt-3.5-turbo-0125:personal::9RAg3mNo (fine-tuned)

Metrics

Accuracycorrectnessfaithfulnessprecisionrecallresponse time (s)

Datasets

Ministry of Finance of Republic of Indonesia documents (2003–2023)Statistics Indonesia publicationsIMF dataMinistry proprietary aggregated transactions (2014–2023, ~180k records)Scraped Q&A from Ministry website (~1,688 valid pairs)Directorate General of Customs and Excise Q&A (~1,299 pairs)

Benchmarks

RAGAS (faithfulness, correctness, precision, recall)Human evaluation (Ministry staff)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Iterative improvements increased human-evaluated accuracy from 35% to 61%.

LLM-based correctness rose from 48% (base) to 64% (KemenkeuGPT).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding