Training Data and Methods Papers — Parsed & Scored for Practitioners

Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

0.70

0.30

0.60

2,595

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Key finding

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers: 2.0T tokens; sizes 7B,13B,34B,70B

A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

0.60

0.45

0.80

299

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Key finding

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

Numbers: Training corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

0.30

0.60

0.50

117

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Key finding

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

Numbers: Avg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

A practical, up-to-date survey of LLMs focused on generating code from natural language

0.70

0.60

0.80

54

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Key finding

Models improved dramatically on small-function benchmarks over recent years.

Numbers: HumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Imitating ChatGPT copies style, not capabilities

0.40

0.50

0.60

50

Imitation can cheaply copy a proprietary model's tone and safety but does not replicate its core reasoning or factual knowledge, so relying on imitation to match competitors is risky.

Key finding

Human raters often prefer or rate imitation outputs equal to ChatGPT.

Numbers: ≈70% of imitation outputs rated equal/better vs ChatGPT

ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

0.60

0.70

40

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Key finding

ChemData size and scope

Numbers: 7M instruction Q&A (authors' dataset summary)

Practical, end-to-end guide to fine-tuning LLMs: pipelines, PEFT, RAG, alignment and deployment

0.70

0.35

0.70

39

Fine-tuning and RAG let you customise LLM behavior and accuracy while controlling cost; PEFT and quantisation let you ship tailored models without enterprise-scale GPU fleets.

Key finding

QLoRA compresses model parameters and enables 4-bit fine-tuning while retaining near-16-bit performance.

Numbers: Reduces to ~5.2 bits/parameter (from 96 bits); ~18x memory reduction

Clinical Camel: an open medical LLM fine-tuned with dialogue synthesis and single‑GPU QLoRA

0.20

0.60

0.40

35

An open, high-performing medical LLM reduces vendor lock-in, enables internal validation, and can be reproduced with modest compute, letting institutions experiment safely before any clinical adoption.

Key finding

Clinical Camel-70B beats GPT-3.5 on several medical QA benchmarks in five-shot tests.

Numbers: USMLE 64.3% vs GPT-3.5 58.5%; PubMedQA 77.9% vs 60.2%

Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

0.75

0.50

0.80

28

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Key finding

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

Numbers: LoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

0.60

0.50

24

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Key finding

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers: 5.3%–23% relative gains on six datasets (one-stage vs two-stage)

A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

0.60

0.40

0.60

24

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Key finding

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers: 8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.71→0.79

Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

0.50

24

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Key finding

Large, law-specific SFT dataset built for training.

Numbers: DISC-Law-SFT total size 403K samples

Practical survey: how to keep LLMs up-to-date via continual pretraining, instruction tuning, and alignment

0.60

0.40

0.70

23

Continual learning lets LLMs stay current with facts, tools and user values without full retraining, saving time and money while reducing model downtime.

Key finding

Continual learning for LLMs is multi-stage: continual pretraining, instruction tuning, and alignment.

KnowEdit benchmark and EasyEdit toolkit: a unified study and comparison of methods to change facts inside LLMs

0.50

0.70

0.60

20

Knowledge editing can cheaply update specific facts or behaviors in an LLM without full retraining, saving compute and time; but edits can fail to generalize and may break unrelated behavior, so careful validation is required.

Key finding

Several editing methods can reach near-perfect edit success on fact-insertion and fact-modification datasets.

Numbers: WikiData recent edit success: AdaLoRA=100, FT-M=100 (Table 4)

Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

0.70

0.50

0.80

20

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Key finding

Instruction tuning increases MoE gains vs dense models.

Numbers: 7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

LawBench: a 20-task Chinese legal benchmark measuring memorization, understanding, and application by 51 LLMs

0.30

0.35

0.40

19

LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.

Key finding

GPT-4 is the best model on LawBench but far from perfect

Numbers: GPT-4 average zero-shot 52.35 (Table 26)

Mix ChatGPT-distilled text with real doctor dialogs, then use RL from AI feedback to make an open-source Chinese medical chatbot that acts (

0.40

0.60

18

HuatuoGPT offers an open-source Chinese medical assistant that is more interactive and clinically oriented than prior open models; this lowers integration cost for localized medical chat services but still needs clinical oversight before deployment.

Key finding

HuatuoGPT wins most manual single-turn comparisons vs. other open-source Chinese medical models.

Numbers: HuatuoGPT manual win rate vs DoctorGLM 98% (single-turn)

A simple, compute-efficient loop that generates model outputs, filters them by a learned reward, and fine-tunes the model offline to align L

0.60

0.70

18

ReST boosts model alignment with human preferences using offline data reuse, cutting the compute cost of repeated online RL sampling while improving translation quality and human-rated outputs.

Key finding

Each additional Improve step raises the model's average reward on validation.

Numbers: Figure 3: steady increases across IWSLT, WMT, Web Domain (rewards normalized 0–100)

AlpaCare: fine-tuning LLaMA with a 52k machine-generated medical instruction dataset to improve medical and general instruction following

0.60

15

A small, diverse machine-generated medical instruction dataset can improve both medical answer quality and general instruction-following, offering a cost-effective way to build better clinical assistants while keeping development and data costs lower than large human-annotation efforts.

Key finding

AlpaCare gives large absolute gains on free-form medical instruction evaluation compared to prior baselines.

Numbers: up to 38.1% absolute gain (paper claim)

Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

0.70

0.45

0.65

15

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Key finding

Most MM-LLMs add small adapters while keeping the core LLM frozen.

Numbers: Trainable params typically ≈2% (projectors only); PEFT can be <0.1%

Teach an LLM to read graph structure with two-stage instruction tuning and a tiny alignment projector

0.50

0.60

15

GraphGPT enables LLMs to use graph structure with low-cost tuning, improving cross-dataset predictions and saving compute by using compact graph tokens instead of long text prompts.

Key finding

GraphGPT improves zero-shot transfer accuracy compared to base LLMs and GNNs on evaluated benchmarks.

Numbers: Arxiv-PubMed zero-shot: GraphGPT-7B-v1.5-std Acc=0.7011 vs vicuna-7B-v1.5 Acc=0.6351 (Δ=+0.066)

Survey of financial LLMs: techniques, benchmarks, and practical gaps

0.50

0.40

0.60

14

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Key finding

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

Numbers: FLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

A practical review of where LLM bias comes from, how to test it, and common fixes

0.50

0.30

0.60

13

Biased LLM outputs can cause legal risk, reputational harm, and unfair customer outcomes; fixing bias early reduces downstream remediation cost and regulatory exposure.

Key finding

Toxicity can emerge quickly from benign prompts in generative LLMs.

Numbers: toxicity > 0.5 within <100 generations

A fast finetuning recipe that makes a large LLM 'forget' Harry Potter while keeping general skills

0.40

0.70

12

You can remove copyrighted or sensitive text from a large LLM with a short, targeted finetune instead of full retraining, cutting compute from hundreds of thousands of GPU-hours to minutes–hours for the targeted edit.

Key finding

The method dramatically reduces model 'familiarity' with Harry Potter as measured by completion-based tests.

Numbers: Familiarity (completion): 0.29 → 0.007 after ~120 finetuning steps