Instruction Tuning Papers — Parsed & Scored for Practitioners

Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

0.70

0.30

0.60

2,595

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Key finding

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers: 2.0T tokens; sizes 7B,13B,34B,70B

A practical, up-to-date survey of LLMs focused on generating code from natural language

0.70

0.60

0.80

54

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Key finding

Models improved dramatically on small-function benchmarks over recent years.

Numbers: HumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Imitating ChatGPT copies style, not capabilities

0.40

0.50

0.60

50

Imitation can cheaply copy a proprietary model's tone and safety but does not replicate its core reasoning or factual knowledge, so relying on imitation to match competitors is risky.

Key finding

Human raters often prefer or rate imitation outputs equal to ChatGPT.

Numbers: ≈70% of imitation outputs rated equal/better vs ChatGPT

One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

0.60

0.50

24

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Key finding

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers: 5.3%–23% relative gains on six datasets (one-stage vs two-stage)

A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

0.60

0.40

0.60

24

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Key finding

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers: 8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.71→0.79

Practical survey: how to keep LLMs up-to-date via continual pretraining, instruction tuning, and alignment

0.60

0.40

0.70

23

Continual learning lets LLMs stay current with facts, tools and user values without full retraining, saving time and money while reducing model downtime.

Key finding

Continual learning for LLMs is multi-stage: continual pretraining, instruction tuning, and alignment.

Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

0.70

0.50

0.80

20

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Key finding

Instruction tuning increases MoE gains vs dense models.

Numbers: 7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

Mix ChatGPT-distilled text with real doctor dialogs, then use RL from AI feedback to make an open-source Chinese medical chatbot that acts (

0.40

0.60

18

HuatuoGPT offers an open-source Chinese medical assistant that is more interactive and clinically oriented than prior open models; this lowers integration cost for localized medical chat services but still needs clinical oversight before deployment.

Key finding

HuatuoGPT wins most manual single-turn comparisons vs. other open-source Chinese medical models.

Numbers: HuatuoGPT manual win rate vs DoctorGLM 98% (single-turn)

AlpaCare: fine-tuning LLaMA with a 52k machine-generated medical instruction dataset to improve medical and general instruction following

0.60

15

A small, diverse machine-generated medical instruction dataset can improve both medical answer quality and general instruction-following, offering a cost-effective way to build better clinical assistants while keeping development and data costs lower than large human-annotation efforts.

Key finding

AlpaCare gives large absolute gains on free-form medical instruction evaluation compared to prior baselines.

Numbers: up to 38.1% absolute gain (paper claim)

Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

0.70

0.45

0.65

15

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Key finding

Most MM-LLMs add small adapters while keeping the core LLM frozen.

Numbers: Trainable params typically ≈2% (projectors only); PEFT can be <0.1%

Teach an LLM to read graph structure with two-stage instruction tuning and a tiny alignment projector

0.50

0.60

15

GraphGPT enables LLMs to use graph structure with low-cost tuning, improving cross-dataset predictions and saving compute by using compact graph tokens instead of long text prompts.

Key finding

GraphGPT improves zero-shot transfer accuracy compared to base LLMs and GNNs on evaluated benchmarks.

Numbers: Arxiv-PubMed zero-shot: GraphGPT-7B-v1.5-std Acc=0.7011 vs vicuna-7B-v1.5 Acc=0.6351 (Δ=+0.066)

Survey of financial LLMs: techniques, benchmarks, and practical gaps

0.50

0.40

0.60

14

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Key finding

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

Numbers: FLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

MolecularGPT — instruction‑tuned LLM that predicts molecular properties with zero‑ and few‑shot prompts

0.50

0.70

0.60

10

MolecularGPT lets teams try new property predictions with two labeled examples instead of costly dataset labeling, speeding early drug/material candidate screening and reducing need to retrain task‑specific models.

Key finding

MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.

Numbers: 2‑shot average rank = 1.1; 8‑shot = 2.1 (Tab.1)

TÜLU 2: a public suite of finetuned LLaMA-2 and Code-LLaMA models, a new instruction-data mix, and large-scale DPO at 70B

0.70

0.40

0.60

8

TÜLU 2 provides high-quality, open instruction-tuned models and data that approach proprietary baselines for many tasks; DPO improves user-facing outputs and is feasible at 70B without private infra, while CODE TÜLU2 gives a fast route to strong code models.

Key finding

The V2 data mixture improves average downstream performance over the prior V1 mixture.

Numbers: V2 > V1 by ~8% avg (paper intro)

FinGPT: instruction-tuning benchmark that evaluates six open-source LLMs on core financial NLP tasks

0.60

0.40

0.70

8

You can adapt open-source 7B LLMs to finance tasks cheaply and reproducibly; choose models per task (Llama2 for general use, BLOOM for extraction, chat models for zero-shot).

Key finding

Llama2 had the best overall ranking across tasks.

Numbers: Avg ranking = 2.0 across SA, NER, HC, RE (Table 2)

A practical pipeline and datasets to adapt general LLMs into telecom-specialized models and benchmarks

0.50

0.40

0.60

7

Fine-tuning mid-size LLMs on telecom-specific text and tasks gives big practical gains in document understanding, math modeling and code tasks at much lower cost than training from scratch.

Key finding

Domain adaptation via instruction tuning and alignment improved telecom math equation recovery.

Numbers: Llama3-8B-TI-TA MathBERT avg score 49.45 vs GPT-4 49.38; ≥90% cases: 9.52% vs GPT-4 3.77%

Use translation + instruction tuning to make English LLMs much better in six non‑English languages

0.60

7

You can upgrade an English LLM to handle multiple non-English languages without huge data or retraining costs by adding parallel translation tasks and translated instructions; this saves time and compute compared to building language-specific models from scratch.

Key finding

Cross-lingual instruction tuning (x-LLaMA) improves non-English QA accuracy a lot versus an English-only instruction model.

Numbers: Average +27.83% answer accuracy across six languages (XQUAD & MLQA)

Train LLMs on private data with federated learning; OpenFedLLM shows FL beats local training and can beat GPT‑4 in finance

0.60

0.55

0.65

6

Companies with private domain data can jointly fine-tune LLMs privately and get measurable gains over solo training; finance firms, hospitals, and firms with sensitive data can gain domain-leading models without sharing raw data.

Key finding

Federated learning consistently improves over single-client local fine-tuning across tasks.

Numbers: multiple tables: e.g., Table 4 MT-Avg FedAvg 3.346 vs Local 2.844 (open-ended)

TABLET: a 20-task benchmark testing whether LLMs can learn tabular prediction from natural-language instructions

0.40

0.60

6

Instructions let you get useful tabular predictions with few or no labels, reducing costly data collection in privacy-sensitive domains.

Key finding

Instructions improve zero-shot LLM performance over prompts without instructions.

Numbers: Flan-T5 zero-shot F1 +20% avg; ChatGPT zero-shot F1 +10% avg (vs LIFT)

Three practical tools for making LLMs more factual in finance: a benchmark, an injection framework, and a retrieval QA system

0.60

5

You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.

Key finding

GPT-4 leads on IDEA-FinBench across subjects.

Numbers: CFA-L1 accuracy 84.26%; CPA-SA 62.38%

Iteratively generate and verify domain instructions (MatSci-Instruct) to finetune LLaMA into HoneyBee, a materials-science LLM

0.60

0.50

5

You can cheaply create domain-ready LLMs by synthesizing and verifying instruction data, avoiding costly domain pretraining while getting strong task performance.

Key finding

Automatically verified instruction scores correlate well with human experts.

Numbers: Spearman/Pearson correlations 0.6–0.8 vs humans (Fig.4)

Lingshu: a medical multimodal foundation model trained on curated medical+general data with MedEvalKit evaluation

0.55

0.50

4

A carefully curated multimodal medical dataset plus staged tuning produces practical, near-proprietary medical QA and reporting performance while enabling smaller, cheaper models for deployment.

Key finding

Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.

Numbers: 3.75M open + 1.30M synthetic (§2.3)

Build a modular Chinese financial LLM by instruction data and four task-specific LoRA experts

0.60

0.50

0.60

4

You can get domain gains cheaply by training small LoRA adapters and plugins instead of re-training big models; this yields better finance answers, more reliable calculations, and modular deployment.

Key finding

Task-specific LoRA adapters raise average FinNLP performance by a few to several points versus the base model.

Numbers: Average improve +2 to +9 points on six FinCUGE tasks (Table 3)

600k-chart instruction data + a human benchmark to improve multimodal chart QA

0.60

0.45

4

Automate chart reading and QA by fine-tuning multimodal LLMs with domain-specific chart instructions; expect better classification and reasoning but not perfect numeric table extraction.

Key finding

Large instruction corpus improves open-source LMMs on chart tasks.

Numbers: MMCA overall free-form 0.26 vs prior open-source best 0.24 (Table 4)