Pretraining Papers — Parsed & Scored for Practitioners

MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

0.30

0.60

0.50

117

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Key finding

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

Numbers: Avg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

Practical survey: how to keep LLMs up-to-date via continual pretraining, instruction tuning, and alignment

0.60

0.40

0.70

23

Continual learning lets LLMs stay current with facts, tools and user values without full retraining, saving time and money while reducing model downtime.

Key finding

Continual learning for LLMs is multi-stage: continual pretraining, instruction tuning, and alignment.

ChiMed‑GPT: a 13B Chinese medical LLM trained with pretraining, SFT and RLHF for safer, better medical answers

0.60

0.45

0.50

10

ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.

Key finding

Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.

Numbers: BLEU-1 33.14 (ChiMed‑GPT) vs 24.29 (GPT-4)

SeaLLMs: language models tuned and tokenized for Southeast Asian languages

0.70

0.80

7

SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.

Key finding

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

Numbers: Thai token ratio improved from 9.09→1.87 (SeaLLM's, Table 1)

Many pre-trained transformers already contain a large "free" sparse subnetwork you can remove with little cost

0.65

0.60

0.70

6

You can cut 30–50% of many pre-trained transformers with one cheap pruning pass and little accuracy loss, saving memory and inference costs without costly retraining.

Key finding

Pre-trained transformers have an 'essential sparsity' range where many weights can be removed with minimal loss.

Numbers: About 30–50% weights removable with <=1% downstream drop (evaluated tasks)

You can adapt LLaMA to other languages cheaply: vocab changes often unnecessary

0.60

0.50

0.80

6

You can cheaply adapt an English-trained LLM to other languages: keep the original tokenizer, do modest further pretraining, and invest in instruction tuning to get usable responses without massive compute.

Key finding

Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.

Numbers: 0.5B vs 30B tokens; LLM-Eval AVG 1.562 (LLaMA 0.5B pretrain) vs 1.244 (Chinese LLaMA) (Table 1)

A 2B Chinese‑centric LLM trained from scratch on 800B Chinese tokens, plus an open Chinese corpus and a hard-case Chinese benchmark.

0.50

0.60

5

If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.

Key finding

They pretrain on a 1.2547 trillion token corpus with a Chinese majority.

Numbers: 1,254.68B total tokens; 840.48B Chinese, 314.88B English, 99.3B code

A linear-attention LLM that matches or beats Transformers while running faster and using less memory

0.60

0.70

0.80

5

TransNormerLLM can lower compute and memory needs for long-context LLM training and serving while keeping or improving accuracy, letting teams run larger contexts or reduce hardware costs without sacrificing model quality.

Key finding

TransNormerLLM yields lower perplexity than Transformer baselines at small and medium scales.

Numbers: 385M model: PPL 4.77 vs Transformer 5.16; 1B model: PPL 3.729 vs Transformer 4.765

Open-source multimodal financial LLMs trained on 52B tokens with instruction and chart/table tuning

0.70

0.60

4

Models trained on large finance-specific corpora plus multimodal tuning make practical tasks—report parsing, numeric QA, and chart/table extraction—work better out of the box for analysts and automation.

Key finding

Large finance-focused continual pretraining improves zero/few-shot task performance.

Numbers: FinLLaMA zero-shot TSA sentiment 81 vs LLaMA3-8B 75 (Table 5)

PharmaGPT: 13B–70B domain LLMs that outperform general models on pharmacy and chemistry tests

0.60

4

Focused domain models give near–GPT-4 quality on bio-pharma tasks with fewer resources, enabling faster, cheaper deployment for search, translation, tutoring, and R&D assistants; validate before clinical use.

Key finding

PharmaGPT 0.7 scores 66–76 on NAPLEX sections, outperforming earlier PharmaGPT versions and GPT-3.5-turbo.

Numbers: NAPLEX I/II/III = 66 / 68 / 76 (PharmaGPT 0.7) [Table 4]

XGen-7B: an open 7B LLM trained up to 8K context (1.5T tokens) with instruction-tuned releases

0.70

0.50

0.70

4

XGen-7B gives teams a practical, open 7B model that handles long documents (up to 8K tokens) and competitive instruction-following, lowering cost versus much larger closed models while keeping good accuracy.

Key finding

Stage-wise training yields an 8K-capable model that uses long context.

Numbers: 800B@2K + 400B@4K + 300B@8K = 1.5T tokens

Small amounts of code in pre-training measurably boost general LLM abilities across many tasks

0.60

0.50

0.70

3

Adding a modest fraction of code to pretraining reliably boosts reasoning and generation, while small high-quality code sets provide big returns — so invest in curated code sources and include code in the final data up-weighting.

Key finding

Adding code to pretraining improves non-code tasks versus text-only.

Numbers: Balanced→text: +8.2% NL reasoning; +4.2% world knowledge; +6.6% win-rate; ~12× code

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

0.50

0.40

0.70

3

You can get near‑state performance for some English, Chinese and domain tasks with 1–3B models, cutting training and deployment cost while keeping the ability to adapt to law or finance via targeted fine‑tuning.

Key finding

MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.

Numbers: MMLU 26.6 vs 24.1 (Table 5 / Table 7)

Two open 1.8B LLMs (base + chat) trained with FP8 and staged data; Danube2 tops open leaderboard under 2B

0.80

0.40

0.80

3

You get permissively licensed, high-performing small LLMs (1.8B) ready for commercial use; smaller models cut inference cost and enable community fine-tuning under Apache 2.0.

Key finding

Danube2 is top-ranked among open models below 2B on Hugging Face Open LLM Leaderboard

Numbers: Average score 48.72 (Table 6)

Large-scale benchmark: continual pretraining helps GPT models but can harm Llama2‑7B

0.40

0.60

3

Continual pretraining can produce better domain experts and reduce repeated retraining costs for smaller models, but it carries heavy compute and can harm very large models unless domain corpora are large and relevant.

Key finding

Continual pretraining reliably improves GPT-2 family perplexity and outperforms standalone domain-adaptive pretraining.

Numbers: Measured over 159 domains; CPT median better than DAPT across GPT2 sizes

Match expensive re-training by re-warming/decaying the LR plus replay to update LLMs efficiently

0.70

0.45

0.80

3

You can update large LLMs on fresh data at far lower compute cost than full re-training while keeping model quality similar, cutting operational cost and turnaround time for model updates.

Key finding

Re-warming then re-decaying the learning rate is required to adapt well to new pre-training data.

Learn new visual classes at inference like ChatGPT — no per-query fine-tuning required

0.60

0.70

3

CAML can learn new visual classes at query time without costly per-query fine-tuning, lowering latency and infrastructure cost for few-shot vision services while keeping strong accuracy on many tasks.

Key finding

CAML matches or exceeds P>M>F (state-of-the-art meta-learner trained on each benchmark) on 8 out of 11 benchmarks.

Numbers: 8/11 benchmarks

Generative AI can synthesize virtual IMU data to augment and pretrain HAR models

0.30

0.60

0.70

3

Synthetic IMU data can cut labeling costs and accelerate development of wearable activity features, but synthetic-to-real gaps require small calibration sets and validation for product safety.

Key finding

A text→motion→IMU pipeline can produce labeled virtual IMU data and boost HAR performance on standard datasets.

Fietje: open, compact Dutch LLM (2.8B) trained on 28B Dutch tokens with full reproducibility

0.60

0.50

0.60

2

Open, compact Dutch LLMs let teams run fast, inexpensive inference and reproduce experiments; modern multilingual small models often beat older larger Dutch models, so try recent small multilingual options before costly full retraining.

Key finding

Fietje was continue-pretrained on 28 billion Dutch tokens.

Numbers: 28B Dutch tokens

Juru: a 7B model specialized on 1.9B Brazilian legal tokens that improves legal exam accuracy but harms general knowledge

0.40

0.60

2

You can cheaply improve an LLM for a legal product by continued pretraining on a modest, high-quality legal corpus, but expect trade-offs: general-purpose capabilities can degrade.

Key finding

Specialization improves legal-exam accuracy vs base model.

Numbers: Mean accuracy +4.7% (44.5% → 49.2%) on 8 legal exams

Use small synthetic QA datasets and a PPL curriculum to boost Chinese and scientific reasoning in Llama‑3 with ~100B CPT tokens

0.60

2

You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.

Key finding

C-Eval (Chinese) improved by 8.81 points after CPT.

Numbers: C‑Eval: 49.43 → 58.24 (+8.81)

MorphPiece: a morpheme-aware tokenizer that improves LM and embedding quality

0.50

0.60

0.40

2

MorphPiece yields better language modeling and embedding quality without changing model architecture, which can improve search, classification, and prediction pipelines but increases token counts and compute.

Key finding

MorphGPT lowers token-level perplexity vs GPT-2 on standard LM benchmarks.

Numbers: PennTreeBank ppl 61.86 -> 38.25 (Morph200)

Predict multiple future words and train on word-difference targets to reduce local overfitting in causal language modeling

0.60

0.50

0.40

1

Small, easy-to-add heads and a WDR target can lower perplexity and raise BLEU with little parameter cost; this improves model quality fast without reworking vocabulary or core architecture.

Key finding

N-gram methods reduce perplexity on standard CLM benchmarks.

Numbers: TT baseline PTB PPL 55.0 → TT+WDR ensemble 44.4 (−10.6)

Use server-side multimodal LLMs to bootstrap federated learning on heterogeneous, long-tailed image data

0.60

0.65

1

You can improve federated accuracy on skewed client data without increasing client compute or sending gradients, lowering device cost and privacy exposure while using server compute and public web data.

Key finding

MLLM-LLaVA-FL beats CLIP2FL on CIFAR-LT benchmarks

Numbers: CIFAR-10-LT IF=100: 75.49% vs 73.37% (+2.12%); CIFAR-100-LT IF=100: 39.50% vs 37.56% (+1.94%)