Open benchmark and a tuned LLM (CALM) show GPT-4-level credit scoring but expose measurable bias

October 1, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper provides a reusable benchmark, released assets, and measurable results, but experiments use 7B-size models and the bias analysis shows real risks that need mitigation before production.

Citations12

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, Hao Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can cut prototype time: GPT-4 often matches expert pipelines on some credit tasks and a tuned open model (CALM) can match closed models, but fairness checks are mandatory before any customer-facing use.

Who Should Care

Summary TLDR

The authors build an open benchmark (9 tabular datasets, ~14k examples), an instruction-tuning collection (~45k examples), and CALM — a Llama2-chat based credit-and-risk LLM fine-tuned with LoRA. GPT-4 already approaches expert-system accuracy on some credit tasks. Fine-tuning with task-specific instruction data (CALM) raises balanced metrics (MCC) on several datasets, but models can learn and amplify dataset biases (measured by DI, EOD, AOD). All code and data are released.

Problem Statement

Credit scoring systems are usually task-specific and don’t transfer well across related financial tasks. The paper asks whether large language models can (1) generalize across credit and risk tasks, (2) improve by instruction tuning on domain examples, and (3) avoid introducing or amplifying fairness harms.

Main Contribution

Curated a focused benchmark for credit and risk tasks: 9 open datasets, ~14K examples across credit scoring, fraud detection, financial distress and claims.

Built a 45K-sample instruction-tuning corpus and released it plus code and benchmarks.

Key Findings

GPT-4 can reach near-expert accuracy on some credit tasks.

NumbersLending Club Acc 0.762 vs SOTA 0.777; Travel Insurance F1 0.897 vs SOTA 0.912

Practical UseFor fast prototyping or cross-task checks, try GPT-4 with one-shot prompts before building task-specific systems.

Evidence RefTable 3; Section 6.2.1

Instruction tuning a 7B LLM (CALM) raises balanced performance on several trained datasets.

NumbersCALM shows higher MCC on trained sets (Credit Card Fraud, ccFraud, Taiwan) and F1 up to 0.971 on Credit Card Fraud

Practical UseInvesting in ~45k targeted instruction examples and LoRA fine-tuning can turn a base open LLM into a competitive domain model for balanced predictions.

Evidence RefTable 3; Section 6.2.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Benchmark size9 datasets, ~14,000 samplesSection 3; Table 1Table 1
Instruction tuning size45,000 samplesSection 3; 3.2Section 3.2

What To Try In 7 Days

Run GPT-4 on a few held-out rows to benchmark one-shot performance.

Create table-based and description-based prompts from your data and measure 'Miss' and MCC.

Compute DI/EOD/AOD with AI Fairness 360 on your data and model outputs before deploying anything.

Optimization Features

Infra Optimization
fine-tuned on 4×A100 40GB GPUs
Model Optimization
LoRA
Training Optimization
instruction tuning (45k)resampling minority class to 2:1 balanceAdamW optimizer

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Fine-tuning and evaluation used ~7B models due to compute limits; larger models may behave differently.

Datasets include anonymized tables that can be harder to learn and reduce transfer.

When Not To Use

High-stakes automated lending without a full fairness and regulatory audit.

When strict local explainability laws prevent opaque models.

Failure Modes

Predicting the majority class on imbalanced data (MCC ≈ 0).

Returning irrelevant answers ('Miss') when prompts are unclear.

Core Entities

Models

CALMLlama2-chatGPT-4ChatGPTLlama2Llama1VicunaBloomzChatglm2FinMA

Metrics

AccuracyF1MCCDisparate Impact (DI)Equal Opportunity Difference (EOD)Average Odds Difference (AOD)Miss

Datasets

GermanAustraliaLending ClubCredit Card FraudccFraudPolishTaiwan Economic JournalPortoSeguroTravel InsuranceCustomsDeclaration

Benchmarks

Credit and Risk Assessment Benchmark (9 datasets, ~14K)CALM instruction tuning corpus (45K)