Open benchmark and a tuned LLM (CALM) show GPT-4-level credit scoring but expose measurable bias

October 1, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

12

Authors

Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, Hao Wang

Links

Abstract / PDF

Why It Matters For Business

LLMs can cut prototype time: GPT-4 often matches expert pipelines on some credit tasks and a tuned open model (CALM) can match closed models, but fairness checks are mandatory before any customer-facing use.

Summary TLDR

The authors build an open benchmark (9 tabular datasets, ~14k examples), an instruction-tuning collection (~45k examples), and CALM — a Llama2-chat based credit-and-risk LLM fine-tuned with LoRA. GPT-4 already approaches expert-system accuracy on some credit tasks. Fine-tuning with task-specific instruction data (CALM) raises balanced metrics (MCC) on several datasets, but models can learn and amplify dataset biases (measured by DI, EOD, AOD). All code and data are released.

Problem Statement

Credit scoring systems are usually task-specific and don’t transfer well across related financial tasks. The paper asks whether large language models can (1) generalize across credit and risk tasks, (2) improve by instruction tuning on domain examples, and (3) avoid introducing or amplifying fairness harms.

Main Contribution

Curated a focused benchmark for credit and risk tasks: 9 open datasets, ~14K examples across credit scoring, fraud detection, financial distress and claims.

Built a 45K-sample instruction-tuning corpus and released it plus code and benchmarks.

Trained CALM (Llama2-chat fine-tuned via LoRA) and measured that fine-tuning improves balanced metrics on several tasks but also can inherit bias.

Key Findings

GPT-4 can reach near-expert accuracy on some credit tasks.

NumbersLending Club Acc 0.762 vs SOTA 0.777; Travel Insurance F1 0.897 vs SOTA 0.912

Instruction tuning a 7B LLM (CALM) raises balanced performance on several trained datasets.

NumbersCALM shows higher MCC on trained sets (Credit Card Fraud, ccFraud, Taiwan) and F1 up to 0.971 on Credit Card Fraud

LLMs can exhibit measurable fairness gaps that differ by model and dataset.

NumbersGPT-4 AOD = -0.273 (ccFraud, gender); GPT-4 EOD = 0.289 (German, foreigner)

Open-source base LLMs often predict the majority class without domain tuning.

NumbersMany open LLMs show Mcc ≈ 0 and trivial F1s on imbalanced sets

Results

Benchmark size

Value9 datasets, ~14,000 samples

Instruction tuning size

Value45,000 samples

GPT-4 vs SOTA (example)

ValueAcc 0.762 (GPT-4) vs 0.777 (SOTA) on Lending Club

BaselineSOTA expert system

CALM balanced performance (example)

ValueMCC improved on trained sets; F1 0.971 on Credit Card Fraud

Baselinebase Llama2-chat

Fairness gaps (example)

ValueGPT-4 AOD = -0.273 (ccFraud gender); GPT-4 EOD = 0.289 (German foreigner)

Baselineideal 0.0

Who Should Care

What To Try In 7 Days

Run GPT-4 on a few held-out rows to benchmark one-shot performance.

Create table-based and description-based prompts from your data and measure 'Miss' and MCC.

Compute DI/EOD/AOD with AI Fairness 360 on your data and model outputs before deploying anything.

Optimization Features

Infra Optimization

  • fine-tuned on 4×A100 40GB GPUs

Model Optimization

  • LoRA

Training Optimization

  • instruction tuning (45k)
  • resampling minority class to 2:1 balance
  • AdamW optimizer

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Fine-tuning and evaluation used ~7B models due to compute limits; larger models may behave differently.
  • Datasets include anonymized tables that can be harder to learn and reduce transfer.
  • Paper does not provide interpretable labels or a dedicated explainability dataset.

When Not To Use

  • High-stakes automated lending without a full fairness and regulatory audit.
  • When strict local explainability laws prevent opaque models.
  • On heavily anonymized/encoded features without additional domain engineering.

Failure Modes

  • Predicting the majority class on imbalanced data (MCC ≈ 0).
  • Returning irrelevant answers ('Miss') when prompts are unclear.
  • Learning and amplifying dataset bias (e.g., foreigner/gender effects).
  • Poor transfer on datasets with anonymized features (e.g., PortoSeguro).

Core Entities

Models

  • CALM
  • Llama2-chat
  • GPT-4
  • ChatGPT
  • Llama2
  • Llama1
  • Vicuna
  • Bloomz
  • Chatglm2
  • FinMA

Metrics

  • Accuracy
  • F1
  • MCC
  • Disparate Impact (DI)
  • Equal Opportunity Difference (EOD)
  • Average Odds Difference (AOD)
  • Miss

Datasets

  • German
  • Australia
  • Lending Club
  • Credit Card Fraud
  • ccFraud
  • Polish
  • Taiwan Economic Journal
  • PortoSeguro
  • Travel Insurance
  • CustomsDeclaration

Benchmarks

  • Credit and Risk Assessment Benchmark (9 datasets, ~14K)
  • CALM instruction tuning corpus (45K)