Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
12
Why It Matters For Business
LLMs can cut prototype time: GPT-4 often matches expert pipelines on some credit tasks and a tuned open model (CALM) can match closed models, but fairness checks are mandatory before any customer-facing use.
Summary TLDR
The authors build an open benchmark (9 tabular datasets, ~14k examples), an instruction-tuning collection (~45k examples), and CALM — a Llama2-chat based credit-and-risk LLM fine-tuned with LoRA. GPT-4 already approaches expert-system accuracy on some credit tasks. Fine-tuning with task-specific instruction data (CALM) raises balanced metrics (MCC) on several datasets, but models can learn and amplify dataset biases (measured by DI, EOD, AOD). All code and data are released.
Problem Statement
Credit scoring systems are usually task-specific and don’t transfer well across related financial tasks. The paper asks whether large language models can (1) generalize across credit and risk tasks, (2) improve by instruction tuning on domain examples, and (3) avoid introducing or amplifying fairness harms.
Main Contribution
Curated a focused benchmark for credit and risk tasks: 9 open datasets, ~14K examples across credit scoring, fraud detection, financial distress and claims.
Built a 45K-sample instruction-tuning corpus and released it plus code and benchmarks.
Trained CALM (Llama2-chat fine-tuned via LoRA) and measured that fine-tuning improves balanced metrics on several tasks but also can inherit bias.
Key Findings
GPT-4 can reach near-expert accuracy on some credit tasks.
Instruction tuning a 7B LLM (CALM) raises balanced performance on several trained datasets.
LLMs can exhibit measurable fairness gaps that differ by model and dataset.
Open-source base LLMs often predict the majority class without domain tuning.
Results
Benchmark size
Instruction tuning size
GPT-4 vs SOTA (example)
CALM balanced performance (example)
Fairness gaps (example)
Who Should Care
What To Try In 7 Days
Run GPT-4 on a few held-out rows to benchmark one-shot performance.
Create table-based and description-based prompts from your data and measure 'Miss' and MCC.
Compute DI/EOD/AOD with AI Fairness 360 on your data and model outputs before deploying anything.
Optimization Features
Infra Optimization
- fine-tuned on 4×A100 40GB GPUs
Model Optimization
- LoRA
Training Optimization
- instruction tuning (45k)
- resampling minority class to 2:1 balance
- AdamW optimizer
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Fine-tuning and evaluation used ~7B models due to compute limits; larger models may behave differently.
- Datasets include anonymized tables that can be harder to learn and reduce transfer.
- Paper does not provide interpretable labels or a dedicated explainability dataset.
When Not To Use
- High-stakes automated lending without a full fairness and regulatory audit.
- When strict local explainability laws prevent opaque models.
- On heavily anonymized/encoded features without additional domain engineering.
Failure Modes
- Predicting the majority class on imbalanced data (MCC ≈ 0).
- Returning irrelevant answers ('Miss') when prompts are unclear.
- Learning and amplifying dataset bias (e.g., foreigner/gender effects).
- Poor transfer on datasets with anonymized features (e.g., PortoSeguro).
Core Entities
Models
- CALM
- Llama2-chat
- GPT-4
- ChatGPT
- Llama2
- Llama1
- Vicuna
- Bloomz
- Chatglm2
- FinMA
Metrics
- Accuracy
- F1
- MCC
- Disparate Impact (DI)
- Equal Opportunity Difference (EOD)
- Average Odds Difference (AOD)
- Miss
Datasets
- German
- Australia
- Lending Club
- Credit Card Fraud
- ccFraud
- Polish
- Taiwan Economic Journal
- PortoSeguro
- Travel Insurance
- CustomsDeclaration
Benchmarks
- Credit and Risk Assessment Benchmark (9 datasets, ~14K)
- CALM instruction tuning corpus (45K)

