A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

March 30, 20238 min

Overview

Decision SnapshotNeeds Validation

The model shows strong, repeated improvements on finance tasks and robust general-task results. However, the model and core data are not released, training is expensive, and some internal eval data may overlap with training, so external replication is limited.

Citations299

Evidence Strength0.85

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 45%

Authors

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

Links

Abstract / PDF

Why It Matters For Business

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Who Should Care

Summary TLDR

BloombergGPT is a 50.6B-parameter, decoder-only language model trained on a mixed corpus of roughly 709 billion tokens (363B financial + 345B public). The authors show that mixing large curated finance data (FinPile) with standard public corpora produces substantial gains on financial tasks (public and internal), while preserving or improving performance on general benchmarks. Training used industry-scale infra (512 A100 GPUs), ZeRO stage-3 sharding, BF16 mixed precision, and a Unigram tokenizer. The model and FinPile are not released.

Problem Statement

Financial NLP needs models that understand finance-specific language and numeric/structured data. General LLMs perform well broadly but lag on finance tasks. The paper asks whether training a single mid-sized LLM on a mix of large, curated financial data plus general data yields strong in-domain performance without losing general abilities.

Main Contribution

Build and document FinPile: a 363B-token curated financial corpus and mix it with 345B public tokens

Train BloombergGPT: a 50.6B decoder-only model (70 layers, 40 heads) on 569B tokens (one partial epoch)

Key Findings

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

NumbersTraining corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

Practical UseIf you need finance accuracy, include large, curated in-domain data alongside general corpora when training or fine-tuning a single model.

Evidence Ref§2, Table 1; §4

BloombergGPT outperforms similar-sized open models on public financial benchmarks

NumbersPublic financial tasks average: BloombergGPT 62.51 vs GPT-NeoX 51.90 (Table 8)

Practical UseUse a finance-specialized LLM to gain ~10+ points average on finance classification/QA over general LLMs.

Evidence RefTable 8

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy62.51 (BloombergGPT)51.90 (GPT-NeoX)+10.61Table 8 aggregatedTable 8 reports All Tasks (avg)Table 8
Internal sentiment tasks average (weighted F1)62.47 (BloombergGPT)29.23 (GPT-NeoX)+33.24Table 10 aggregated internal datasetsTable 10 shows large gains across internal sentiment datasetsTable 10

What To Try In 7 Days

Run a small few-shot evaluation comparing a general LLM to a domain-adapted model on your finance task

Curate a modest in-domain dataset (news, filings, press releases) and fine-tune an existing LLM for sentiment/NER

Prototype natural-language-to-query flows (like BQL examples) using a few-shot prompt and validate outputs

Optimization Features

Token Efficiency
Unigram tokenizer with 131,072 vocab (multi-word tokens allowed)
Infra Optimization
AWS SageMaker on 64 p4d.24xlarge instances (512 A100 GPUs)Amazon FSX for Lustre for fast IO
Model Optimization
Positional Bias CheckLayerNorm variants (embedding LN added)Query-key layer scaling
System Optimization
Fused masked-causal-softmax kernels for attentionBF16 mixed precision for forward/backward, FP32 for parameter updates and some softmax steps
Training Optimization
ZeRO stage-3 sharding across 128 GPUsActivation checkpointing to reduce memoryMiCS hierarchical communication patterns
Inference Optimization
ALiBi allows longer inference contexts than training window

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Model weights and FinPile are not released; replication is not possible from paper alone

FinPile contains private and purchased content; cannot be inspected or reused

When Not To Use

When you require an open-source model and full reproducibility

For use-cases demanding knowledge updated after July 2022 (training cutoff)

Failure Modes

Hallucination or wrong numeric reasoning on unseen company/report facts

Outdated facts for events after training cutoff (2022-07-31)

Core Entities

Models

BloombergGPTBLOOM 176BGPT-NeoX-20BOPT 66BGPT-3PaLM 540B

Metrics

F1Exact MatchAccuracyBits per bytePerplexityTFLOPsWin rate

Datasets

FinPileThe PileC4Wikipedia (2022-07-01)ConvFinQAFLUEFPBFiQA SA

Benchmarks

BIG-bench HardMMLUReading ComprehensionLinguistic TasksFLUEConvFinQA

Context Entities

Models

FLAN-T5-XXLMinervaBioGPTGalactica

Datasets

EDGAR (Filings)Bloomberg NewsOpenWebText2

Benchmarks

HELMBBH (subset)