A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

Overview

Decision SnapshotNeeds Validation

The model shows strong, repeated improvements on finance tasks and robust general-task results. However, the model and core data are not released, training is expensive, and some internal eval data may overlap with training, so external replication is limited.

Citations299

Evidence Strength0.85

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 45%

Authors

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

Links

Abstract / PDF

Why It Matters For Business

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Who Should Care

Product Manager ML Engineer Founder CTO Data Scientist

Summary TLDR

BloombergGPT is a 50.6B-parameter, decoder-only language model trained on a mixed corpus of roughly 709 billion tokens (363B financial + 345B public). The authors show that mixing large curated finance data (FinPile) with standard public corpora produces substantial gains on financial tasks (public and internal), while preserving or improving performance on general benchmarks. Training used industry-scale infra (512 A100 GPUs), ZeRO stage-3 sharding, BF16 mixed precision, and a Unigram tokenizer. The model and FinPile are not released.

Problem Statement

Financial NLP needs models that understand finance-specific language and numeric/structured data. General LLMs perform well broadly but lag on finance tasks. The paper asks whether training a single mid-sized LLM on a mix of large, curated financial data plus general data yields strong in-domain performance without losing general abilities.

Main Contribution

Build and document FinPile: a 363B-token curated financial corpus and mix it with 345B public tokens

Train BloombergGPT: a 50.6B decoder-only model (70 layers, 40 heads) on 569B tokens (one partial epoch)

Key Findings

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

NumbersTraining corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

Practical UseIf you need finance accuracy, include large, curated in-domain data alongside general corpora when training or fine-tuning a single model.

Evidence Ref§2, Table 1; §4

BloombergGPT outperforms similar-sized open models on public financial benchmarks

NumbersPublic financial tasks average: BloombergGPT 62.51 vs GPT-NeoX 51.90 (Table 8)

Practical UseUse a finance-specialized LLM to gain ~10+ points average on finance classification/QA over general LLMs.

Evidence RefTable 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	62.51 (BloombergGPT)	51.90 (GPT-NeoX)	+10.61	Table 8 aggregated	Table 8 reports All Tasks (avg)	Table 8
Internal sentiment tasks average (weighted F1)	62.47 (BloombergGPT)	29.23 (GPT-NeoX)	+33.24	Table 10 aggregated internal datasets	Table 10 shows large gains across internal sentiment datasets	Table 10

What To Try In 7 Days

Run a small few-shot evaluation comparing a general LLM to a domain-adapted model on your finance task

Curate a modest in-domain dataset (news, filings, press releases) and fine-tune an existing LLM for sentiment/NER

Prototype natural-language-to-query flows (like BQL examples) using a few-shot prompt and validate outputs

Optimization Features

Token Efficiency

Unigram tokenizer with 131,072 vocab (multi-word tokens allowed)

Infra Optimization

AWS SageMaker on 64 p4d.24xlarge instances (512 A100 GPUs)Amazon FSX for Lustre for fast IO

Model Optimization

Positional Bias CheckLayerNorm variants (embedding LN added)Query-key layer scaling

System Optimization

Fused masked-causal-softmax kernels for attentionBF16 mixed precision for forward/backward, FP32 for parameter updates and some softmax steps

Training Optimization

ZeRO stage-3 sharding across 128 GPUsActivation checkpointing to reduce memoryMiCS hierarchical communication patterns

Inference Optimization

ALiBi allows longer inference contexts than training window

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Model weights and FinPile are not released; replication is not possible from paper alone

FinPile contains private and purchased content; cannot be inspected or reused

When Not To Use

When you require an open-source model and full reproducibility

For use-cases demanding knowledge updated after July 2022 (training cutoff)

Failure Modes

Hallucination or wrong numeric reasoning on unseen company/report facts

Outdated facts for events after training cutoff (2022-07-31)

Core Entities

Models

BloombergGPTBLOOM 176BGPT-NeoX-20BOPT 66BGPT-3PaLM 540B

Metrics

F1Exact MatchAccuracyBits per bytePerplexityTFLOPsWin rate

Datasets

FinPileThe PileC4Wikipedia (2022-07-01)ConvFinQAFLUEFPBFiQA SA

Benchmarks

BIG-bench HardMMLUReading ComprehensionLinguistic TasksFLUEConvFinQA

Context Entities

Models

FLAN-T5-XXLMinervaBioGPTGalactica

Datasets

EDGAR (Filings)Bloomberg NewsOpenWebText2

Benchmarks

HELMBBH (subset)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

BloombergGPT outperforms similar-sized open models on public financial benchmarks

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding