Overview
Production Readiness
0.4
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
KemenkeuGPT can speed retrieval of government finance rules and data, reducing manual search time and helping staff make faster, evidence-based decisions while still needing human verification and more data coverage.
Summary TLDR
This paper builds KemenkeuGPT, an application that combines LangChain, Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning to answer questions about Indonesian government finance and regulations. The team collected public data (2003–2023), ~180k aggregated transactions, and ~3k curated Q&A pairs, used GPT-3.5 Turbo as the base model, and iteratively improved the system with stakeholder feedback. Human-evaluated accuracy rose from 35% → 42% (added docs) → 60% (prompt engineering) → 61% (fine-tuning). RAGAS benchmarking shows correctness 0.44, faithfulness 0.73, precision 0.40, recall 0.60. The system is useful for speeding information access but still needs more data, aUI
Problem Statement
Government financial data and rules are large, changing, and dispersed. Manual searches are slow. Off-the-shelf LLMs give general answers and can hallucinate. The problem: make an LLM that returns accurate, sourced answers for Indonesian finance and can be iteratively improved with human feedback.
Main Contribution
Built KemenkeuGPT: a LangChain + RAG application focused on Indonesian government finance and regulations.
Assembled a dataset from 2003–2023 including 180,000 aggregated transactions and ~3,000 curated Q&A pairs for training and fine-tuning.
Iterative pipeline: compare base LLMs, add documents to RAG, apply prompt engineering, fine-tune GPT-3.5 Turbo, and collect human feedback.
Measured performance with human evaluation, LLM-based checks, and the RAGAS benchmark (faithfulness, correctness, precision, recall).
Delivered a multi-platform UI and a feedback loop for continuous improvement.
Key Findings
Iterative improvements increased human-evaluated accuracy from 35% to 61%.
LLM-based correctness rose from 48% (base) to 64% (KemenkeuGPT).
RAGAS benchmark: KemenkeuGPT scored 0.44 correctness, 0.73 faithfulness, 0.40 precision, 0.60 recall.
Adding documents increased response time from ~1.53s to ~10.93s during prompt-engineered stage.
GPT-3.5 Turbo (fine-tuned) outperformed six other tested LLMs on this task.
Results
Accuracy
LLM-based correctness
RAGAS scores (KemenkeuGPT)
Response time
Who Should Care
What To Try In 7 Days
Run a small LangChain+Pinecone RAG demo with 100–500 domain docs.
Collect ~200 representative Q&A pairs and use them for instruction tuning or few-shot prompts.
Add a simple feedback button and route responses back for expert review within your team.
Optimization Features
Infra Optimization
- Experimented with Google Cloud Storage and scalable vector DB (Pinecone)
Model Optimization
- Fine-tuning GPT-3.5 Turbo on curated Q&A
System Optimization
- Use of vector DBs (Pinecone/Chroma) and chunked retrieval
Training Optimization
- SFT
Inference Optimization
- Prompt engineering to format outputs and enforce language matching
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Model trained on partial and aggregate government data; detailed transactions unavailable due to data access limits.
- System cannot auto-update data; manual uploads required.
- Accuracy remains below 70% on evaluated items; not yet reliable for fully automated decisions.
- Response latency increases with a larger retrieval corpus and complex prompt pipelines.
- User interface is basic and limited to web/mobile views.
When Not To Use
- For fully automated, unreviewed policy decisions or legal rulings.
- When real-time transaction-level details are required (data not available).
- As the single source of truth without human expert verification.
Failure Modes
- Hallucinated answers despite high faithfulness score on some queries.
- Outdated responses if dataset is not regularly refreshed.
- Missing or incomplete retrieval context leading to incorrect answers.
- Slower responses as the RAG corpus grows or prompts become heavier.
Core Entities
Models
- OpenAI GPT-3.5 Turbo
- Llama-2-7b-chat-hf
- Titan Text G1 Express
- Titan Text G1 Lite
- Mistral 7B
- Mixtral 8X7B Instruct
- Jurassic-2 Ultra
- ft:gpt-3.5-turbo-0125:personal::9RAg3mNo (fine-tuned)
Metrics
- Accuracy
- correctness
- faithfulness
- precision
- recall
- response time (s)
Datasets
- Ministry of Finance of Republic of Indonesia documents (2003–2023)
- Statistics Indonesia publications
- IMF data
- Ministry proprietary aggregated transactions (2014–2023, ~180k records)
- Scraped Q&A from Ministry website (~1,688 valid pairs)
- Directorate General of Customs and Excise Q&A (~1,299 pairs)
Benchmarks
- RAGAS (faithfulness, correctness, precision, recall)
- Human evaluation (Ministry staff)

