Domain-specific LLMs Papers — Parsed & Scored for Practitioners

A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

0.60

0.45

0.80

299

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Key finding

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

Numbers: Training corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

0.30

0.60

0.50

117

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Key finding

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

Numbers: Avg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

0.60

0.50

43

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Key finding

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers: 136,609 samples; 5 tasks; 9 datasets

ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

0.60

0.70

40

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Key finding

ChemData size and scope

Numbers: 7M instruction Q&A (authors' dataset summary)

Clinical Camel: an open medical LLM fine-tuned with dialogue synthesis and single‑GPU QLoRA

0.20

0.60

0.40

35

An open, high-performing medical LLM reduces vendor lock-in, enables internal validation, and can be reproduced with modest compute, letting institutions experiment safely before any clinical adoption.

Key finding

Clinical Camel-70B beats GPT-3.5 on several medical QA benchmarks in five-shot tests.

Numbers: USMLE 64.3% vs GPT-3.5 58.5%; PubMedQA 77.9% vs 60.2%

A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

0.60

0.40

0.60

24

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Key finding

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers: 8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.71→0.79

Survey of financial LLMs: techniques, benchmarks, and practical gaps

0.50

0.40

0.60

14

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Key finding

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

Numbers: FLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

A 7B cancer-specialized LLM that matches or beats larger models on phenotype extraction and diagnosis generation

0.60

0.45

0.75

11

CancerLLM shows that a domain-tuned 7B model can reach or exceed larger models on cancer tasks while using far less GPU memory, lowering operational cost for hospitals and clinics.

Key finding

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

Numbers: Diagnosis average F1 = 86.81% (Table 1)

Typhoon: a 7B Thai-focused LLM that matches GPT-3.5 on many Thai tasks and tokenizes Thai 2.62× more efficiently

0.60

0.50

0.70

7

Typhoon gives companies a ready open-source Thai LLM that saves token costs (≈2.6×) and outperforms other open Thai models on exams and many Thai tasks, reducing engineering time versus building a Thai model from scratch.

Key finding

Typhoon is the best open-source Thai LLM on evaluated Thai benchmarks.

Numbers: ThaiExam average 0.442 vs next best SeaLLM 0.366

A practical pipeline and datasets to adapt general LLMs into telecom-specialized models and benchmarks

0.50

0.40

0.60

7

Fine-tuning mid-size LLMs on telecom-specific text and tasks gives big practical gains in document understanding, math modeling and code tasks at much lower cost than training from scratch.

Key finding

Domain adaptation via instruction tuning and alignment improved telecom math equation recovery.

Numbers: Llama3-8B-TI-TA MathBERT avg score 49.45 vs GPT-4 49.38; ≥90% cases: 9.52% vs GPT-4 3.77%

A 2B Chinese‑centric LLM trained from scratch on 800B Chinese tokens, plus an open Chinese corpus and a hard-case Chinese benchmark.

0.50

0.60

5

If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.

Key finding

They pretrain on a 1.2547 trillion token corpus with a Chinese majority.

Numbers: 1,254.68B total tokens; 840.48B Chinese, 314.88B English, 99.3B code

Use a fine-tuned language model plus spatiotemporal patching to predict 2D unsteady fluid flow faster and with lower error than prior ML sur

0.60

0.50

5

FLUID-LLM can cut multi-step prediction error for 2D CFD tasks and adapt from short context histories, helping engineering teams get fast, accurate surrogates without full solver runs.

Key finding

Scaling the LLM reduced long-horizon error on the Cylinder dataset.

Numbers: RMSE at 150 steps: FLUID-OPT125m=0.102 → FLUID-OPT2.7b=0.059 (≈42% reduction)

Build a modular Chinese financial LLM by instruction data and four task-specific LoRA experts

0.60

0.50

0.60

4

You can get domain gains cheaply by training small LoRA adapters and plugins instead of re-training big models; this yields better finance answers, more reliable calculations, and modular deployment.

Key finding

Task-specific LoRA adapters raise average FinNLP performance by a few to several points versus the base model.

Numbers: Average improve +2 to +9 points on six FinCUGE tasks (Table 3)

BiMediX — a bilingual English/Arabic medical Mixture-of-Experts LLM plus a 1.3M bilingual medical instruction set

0.30

0.60

0.80

4

BiMediX shows you can deliver bilingual medical accuracy with much lower serving cost: similar or better accuracy than large 70B models while running 8x faster, making research deployments and low-latency prototypes cheaper.

Key finding

BiMediX beats Med42 and Meditron on English medical benchmarks.

Numbers: avg +2.5% vs Med42; +4.1% vs Meditron (English benchmarks)

PharmaGPT: 13B–70B domain LLMs that outperform general models on pharmacy and chemistry tests

0.60

4

Focused domain models give near–GPT-4 quality on bio-pharma tasks with fewer resources, enabling faster, cheaper deployment for search, translation, tutoring, and R&D assistants; validate before clinical use.

Key finding

PharmaGPT 0.7 scores 66–76 on NAPLEX sections, outperforming earlier PharmaGPT versions and GPT-3.5-turbo.

Numbers: NAPLEX I/II/III = 66 / 68 / 76 (PharmaGPT 0.7) [Table 4]

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

0.60

3

Structure-aware embeddings let search and agents find chemical analogs and spectra faster, cutting researcher time for design and analysis and enabling automated, multimodal retrieval inside lab-facing agent workflows.

Key finding

MoLFormer embeddings retrieve structurally close small-molecule analogs even when fingerprint metrics disagree.

Numbers: 2.5M small-molecule collection; cosine similarity up to 1.00 for identical hits

Train LLMs to read 12‑lead ECGs and draft clinical reports using lightweight multimodal alignment

0.40

0.60

3

MEIT can automate first-draft ECG reports and speed clinician workflows; it uses small extra compute (LoRA + small ECG encoder) and public datasets so teams can prototype quickly.

Key finding

Instruction-tuned LLMs substantially outperform small pretrained language models on report-generation metrics.

Numbers: Example: LLaMA-3-Instruct BLEU-4 0.61 vs GPT2-Large 0.476 on MIMIC-IV-ECG (Table 1)

Fietje: open, compact Dutch LLM (2.8B) trained on 28B Dutch tokens with full reproducibility

0.60

0.50

0.60

2

Open, compact Dutch LLMs let teams run fast, inexpensive inference and reproduce experiments; modern multilingual small models often beat older larger Dutch models, so try recent small multilingual options before costly full retraining.

Key finding

Fietje was continue-pretrained on 28 billion Dutch tokens.

Numbers: 28B Dutch tokens

Two Dutch-tuned Llama 2 models, translated instruction datasets, and a Dutch leaderboard to jumpstart Dutch LLM work

0.45

0.25

0.40

2

If you need Dutch-capable LLMs quickly, this work gives deployable models, translated instruction datasets and quantised weights so teams can iterate without building corpora from scratch.

Key finding

Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.

Numbers: Finetune compute: 120 GPU hours (text), ~55 GPU hours (chat)

Juru: a 7B model specialized on 1.9B Brazilian legal tokens that improves legal exam accuracy but harms general knowledge

0.40

0.60

2

You can cheaply improve an LLM for a legal product by continued pretraining on a modest, high-quality legal corpus, but expect trade-offs: general-purpose capabilities can degrade.

Key finding

Specialization improves legal-exam accuracy vs base model.

Numbers: Mean accuracy +4.7% (44.5% → 49.2%) on 8 legal exams

ArcGPT — a 7B LLM and AMBLE benchmark built for real archival tasks

0.40

0.50

0.40

2

ArcGPT and AMBLE let archives and data teams automate labeling and access decisions using a model trained on archive language; expect faster triage but verify with a predictive model for critical classification.

Key finding

ArcGPT achieves strong classification performance on archival label tasks.

Numbers: F1 = 84.40 (retention), 84.00 (open-access), 94.40 (confidentiality)

Eir-8B: an 8B-parameter Thai medical LLM that improves medical QA, translation, and 18 clinical tasks

1.00

0.60

1

Eir-8B shows tangible gains on Thai medical QA, translation, and 18 clinical tasks, so hospitals and health-tech teams can build higher-quality Thai clinical assistants while keeping data on-premises.

Key finding

Eir-8B-prob achieves a higher average medical benchmark score than Typhoon-v1.5x-8B-instruct.

Numbers: Avg MMLU: Eir-8B+Prob 80.2 vs Typhoon 69.1 (Δ ≈ +11.1)

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

0.50

0.45

0.60

1

Domain pre-training plus instruction tuning yields measurable accuracy gains on finance QA and exam tasks; adding a calculator reduces numeric errors—useful for advisory, research automation, and computation-heavy workflows.

Key finding

Domain continuous pre-training raises benchmark accuracy.

Numbers: FEval +4.64 pp (59.30 → 63.94)

A COBOL- and mainframe-specialized LLM plus a MainframeBench to evaluate modernization tasks

0.60

0.70

1

A model specialized on COBOL and mainframe docs can cut developer triage and summarization work by producing more accurate summaries and answers about legacy code on evaluated tasks.

Key finding

Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).

Numbers: 77.89% accuracy (XMainframe 10.5B) vs 73.9% (GPT-4) and 53.29% (DeepSeek-Coder-Instruct 33B) on the MCQ split (Table 2).