A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

March 30, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.8

Citation Count

299

Authors

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

Links

Abstract / PDF

Why It Matters For Business

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Summary TLDR

BloombergGPT is a 50.6B-parameter, decoder-only language model trained on a mixed corpus of roughly 709 billion tokens (363B financial + 345B public). The authors show that mixing large curated finance data (FinPile) with standard public corpora produces substantial gains on financial tasks (public and internal), while preserving or improving performance on general benchmarks. Training used industry-scale infra (512 A100 GPUs), ZeRO stage-3 sharding, BF16 mixed precision, and a Unigram tokenizer. The model and FinPile are not released.

Problem Statement

Financial NLP needs models that understand finance-specific language and numeric/structured data. General LLMs perform well broadly but lag on finance tasks. The paper asks whether training a single mid-sized LLM on a mix of large, curated financial data plus general data yields strong in-domain performance without losing general abilities.

Main Contribution

Build and document FinPile: a 363B-token curated financial corpus and mix it with 345B public tokens

Train BloombergGPT: a 50.6B decoder-only model (70 layers, 40 heads) on 569B tokens (one partial epoch)

Show large gains on public and internal financial benchmarks while retaining competitive general-task performance

Share practical training notes (Training Chronicles) on instabilities, optimizers, and large-scale infra

Key Findings

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

NumbersTraining corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

BloombergGPT outperforms similar-sized open models on public financial benchmarks

NumbersPublic financial tasks average: BloombergGPT 62.51 vs GPT-NeoX 51.90 (Table 8)

BloombergGPT substantially beats peers on internal finance sentiment tasks

NumbersInternal sentiment avg: BloombergGPT 62.47 vs GPT-NeoX 29.23; win rate 1.0 (Table 10)

Model size and compute choices were Chinchilla-style: smaller model, more tokens

NumbersModel: 50.6B params; training tokens used: 569B; avg TFLOPs 102 (Table 4)

BloombergGPT keeps competitive general NLP skills relative to similar-size models

NumbersBIG-bench Hard all tasks avg: BloombergGPT 41.97 vs GPT-NeoX 40.25; Reading Comprehension avg 61.22 (Tables 13,16)

Results

Accuracy

Value62.51 (BloombergGPT)

Baseline51.90 (GPT-NeoX)

Internal sentiment tasks average (weighted F1)

Value62.47 (BloombergGPT)

Baseline29.23 (GPT-NeoX)

ConvFinQA exact match (numerical QA)

Value43.41 (BloombergGPT)

Baseline30.06 (GPT-NeoX)

Accuracy

Value41.97 (BloombergGPT)

Baseline40.25 (GPT-NeoX)

Accuracy

Value61.22 (BloombergGPT)

Baseline42.81 (GPT-NeoX)

Model size and training compute

Value50.6B parameters; trained on 569B tokens; avg 102 TFLOPs

Who Should Care

What To Try In 7 Days

Run a small few-shot evaluation comparing a general LLM to a domain-adapted model on your finance task

Curate a modest in-domain dataset (news, filings, press releases) and fine-tune an existing LLM for sentiment/NER

Prototype natural-language-to-query flows (like BQL examples) using a few-shot prompt and validate outputs

Optimization Features

Token Efficiency

  • Unigram tokenizer with 131,072 vocab (multi-word tokens allowed)

Infra Optimization

  • AWS SageMaker on 64 p4d.24xlarge instances (512 A100 GPUs)
  • Amazon FSX for Lustre for fast IO

Model Optimization

  • Positional Bias Check
  • LayerNorm variants (embedding LN added)
  • Query-key layer scaling

System Optimization

  • Fused masked-causal-softmax kernels for attention
  • BF16 mixed precision for forward/backward, FP32 for parameter updates and some softmax steps

Training Optimization

  • ZeRO stage-3 sharding across 128 GPUs
  • Activation checkpointing to reduce memory
  • MiCS hierarchical communication patterns

Inference Optimization

  • ALiBi allows longer inference contexts than training window

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model weights and FinPile are not released; replication is not possible from paper alone
  • FinPile contains private and purchased content; cannot be inspected or reused
  • Training stopped after ~80% of one epoch (569B of ~709B tokens), so longer training effects unknown
  • Evaluation relies on internal datasets that may partially overlap with training data, risking optimistic estimates

When Not To Use

  • When you require an open-source model and full reproducibility
  • For use-cases demanding knowledge updated after July 2022 (training cutoff)
  • When compute/budget cannot support large-scale fine-tuning or serving a 50B model

Failure Modes

  • Hallucination or wrong numeric reasoning on unseen company/report facts
  • Outdated facts for events after training cutoff (2022-07-31)
  • Data leakage risks if private FinPile content is unintentionally exposed
  • NER span sensitivity and prompt engineering fragility in few-shot setups

Core Entities

Models

  • BloombergGPT
  • BLOOM 176B
  • GPT-NeoX-20B
  • OPT 66B
  • GPT-3
  • PaLM 540B

Metrics

  • F1
  • Exact Match
  • Accuracy
  • Bits per byte
  • Perplexity
  • TFLOPs
  • Win rate

Datasets

  • FinPile
  • The Pile
  • C4
  • Wikipedia (2022-07-01)
  • ConvFinQA
  • FLUE
  • FPB
  • FiQA SA

Benchmarks

  • BIG-bench Hard
  • MMLU
  • Reading Comprehension
  • Linguistic Tasks
  • FLUE
  • ConvFinQA

Context Entities

Models

  • FLAN-T5-XXL
  • Minerva
  • BioGPT
  • Galactica

Datasets

  • EDGAR (Filings)
  • Bloomberg News
  • OpenWebText2

Benchmarks

  • HELM
  • BBH (subset)