Open benchmark and a tuned LLM (CALM) show GPT-4-level credit scoring but expose measurable bias

Overview

Decision SnapshotNeeds Validation

The paper provides a reusable benchmark, released assets, and measurable results, but experiments use 7B-size models and the bias analysis shows real risks that need mitigation before production.

Citations12

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, Hao Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can cut prototype time: GPT-4 often matches expert pipelines on some credit tasks and a tuned open model (CALM) can match closed models, but fairness checks are mandatory before any customer-facing use.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors build an open benchmark (9 tabular datasets, ~14k examples), an instruction-tuning collection (~45k examples), and CALM — a Llama2-chat based credit-and-risk LLM fine-tuned with LoRA. GPT-4 already approaches expert-system accuracy on some credit tasks. Fine-tuning with task-specific instruction data (CALM) raises balanced metrics (MCC) on several datasets, but models can learn and amplify dataset biases (measured by DI, EOD, AOD). All code and data are released.

Problem Statement

Credit scoring systems are usually task-specific and don’t transfer well across related financial tasks. The paper asks whether large language models can (1) generalize across credit and risk tasks, (2) improve by instruction tuning on domain examples, and (3) avoid introducing or amplifying fairness harms.

Main Contribution

Curated a focused benchmark for credit and risk tasks: 9 open datasets, ~14K examples across credit scoring, fraud detection, financial distress and claims.

Built a 45K-sample instruction-tuning corpus and released it plus code and benchmarks.

Key Findings

GPT-4 can reach near-expert accuracy on some credit tasks.

NumbersLending Club Acc 0.762 vs SOTA 0.777; Travel Insurance F1 0.897 vs SOTA 0.912

Practical UseFor fast prototyping or cross-task checks, try GPT-4 with one-shot prompts before building task-specific systems.

Evidence RefTable 3; Section 6.2.1

Instruction tuning a 7B LLM (CALM) raises balanced performance on several trained datasets.

NumbersCALM shows higher MCC on trained sets (Credit Card Fraud, ccFraud, Taiwan) and F1 up to 0.971 on Credit Card Fraud

Practical UseInvesting in ~45k targeted instruction examples and LoRA fine-tuning can turn a base open LLM into a competitive domain model for balanced predictions.

Evidence RefTable 3; Section 6.2.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Benchmark size	9 datasets, ~14,000 samples	—	—	—	Section 3; Table 1	Table 1
Instruction tuning size	45,000 samples	—	—	—	Section 3; 3.2	Section 3.2

What To Try In 7 Days

Run GPT-4 on a few held-out rows to benchmark one-shot performance.

Create table-based and description-based prompts from your data and measure 'Miss' and MCC.

Compute DI/EOD/AOD with AI Fairness 360 on your data and model outputs before deploying anything.

Optimization Features

Infra Optimization

fine-tuned on 4×A100 40GB GPUs

Model Optimization

LoRA

Training Optimization

instruction tuning (45k)resampling minority class to 2:1 balanceAdamW optimizer

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/colfeng/CALM

Data URLs

https://github.com/colfeng/CALM

Risks & Boundaries

Limitations

Fine-tuning and evaluation used ~7B models due to compute limits; larger models may behave differently.

Datasets include anonymized tables that can be harder to learn and reduce transfer.

When Not To Use

High-stakes automated lending without a full fairness and regulatory audit.

When strict local explainability laws prevent opaque models.

Failure Modes

Predicting the majority class on imbalanced data (MCC ≈ 0).

Returning irrelevant answers ('Miss') when prompts are unclear.

Core Entities

Models

CALMLlama2-chatGPT-4ChatGPTLlama2Llama1VicunaBloomzChatglm2FinMA

Metrics

AccuracyF1MCCDisparate Impact (DI)Equal Opportunity Difference (EOD)Average Odds Difference (AOD)Miss

Datasets

GermanAustraliaLending ClubCredit Card FraudccFraudPolishTaiwan Economic JournalPortoSeguroTravel InsuranceCustomsDeclaration

Benchmarks

Credit and Risk Assessment Benchmark (9 datasets, ~14K)CALM instruction tuning corpus (45K)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 can reach near-expert accuracy on some credit tasks.

Instruction tuning a 7B LLM (CALM) raises balanced performance on several trained datasets.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding