Overview
The system demonstrates a practical pipeline (RAG + prompt engineering + fine-tuning) and measurable gains on evaluated Q&A, but accuracy and data coverage remain moderate and require ongoing human-in-the-loop processes.
Citations0
Evidence Strength0.70
Confidence0.88
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 40%
Why It Matters For Business
KemenkeuGPT can speed retrieval of government finance rules and data, reducing manual search time and helping staff make faster, evidence-based decisions while still needing human verification and more data coverage.
Who Should Care
Summary TLDR
This paper builds KemenkeuGPT, an application that combines LangChain, Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning to answer questions about Indonesian government finance and regulations. The team collected public data (2003–2023), ~180k aggregated transactions, and ~3k curated Q&A pairs, used GPT-3.5 Turbo as the base model, and iteratively improved the system with stakeholder feedback. Human-evaluated accuracy rose from 35% → 42% (added docs) → 60% (prompt engineering) → 61% (fine-tuning). RAGAS benchmarking shows correctness 0.44, faithfulness 0.73, precision 0.40, recall 0.60. The system is useful for speeding information access but still needs more data, aUI
Problem Statement
Government financial data and rules are large, changing, and dispersed. Manual searches are slow. Off-the-shelf LLMs give general answers and can hallucinate. The problem: make an LLM that returns accurate, sourced answers for Indonesian finance and can be iteratively improved with human feedback.
Main Contribution
Built KemenkeuGPT: a LangChain + RAG application focused on Indonesian government finance and regulations.
Assembled a dataset from 2003–2023 including 180,000 aggregated transactions and ~3,000 curated Q&A pairs for training and fine-tuning.
Key Findings
Iterative improvements increased human-evaluated accuracy from 35% to 61%.
LLM-based correctness rose from 48% (base) to 64% (KemenkeuGPT).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Base 35% → RAG 42% → Prompt eng. 60% → Fine-tuned 61% | Base model accuracy 35% | +26 percentage points (35%→61%) | Human-annotated Q&A set (survey + scraped Q&A) | Section 4.3 Table 5–7; Section 5 Table 8 | Table 8 |
| LLM-based correctness | Base 48% → KemenkeuGPT 64% | Base model GPT-3.5 Turbo | +16 percentage points | LLM string-evaluator (LangChain) | Section 5, Figure 6 | Figure 6 |
What To Try In 7 Days
Run a small LangChain+Pinecone RAG demo with 100–500 domain docs.
Collect ~200 representative Q&A pairs and use them for instruction tuning or few-shot prompts.
Add a simple feedback button and route responses back for expert review within your team.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Model trained on partial and aggregate government data; detailed transactions unavailable due to data access limits.
System cannot auto-update data; manual uploads required.
When Not To Use
For fully automated, unreviewed policy decisions or legal rulings.
When real-time transaction-level details are required (data not available).
Failure Modes
Hallucinated answers despite high faithfulness score on some queries.
Outdated responses if dataset is not regularly refreshed.

