Finetuning Papers — Parsed & Scored for Practitioners

Practical, end-to-end guide to fine-tuning LLMs: pipelines, PEFT, RAG, alignment and deployment

0.70

0.35

0.70

39

Fine-tuning and RAG let you customise LLM behavior and accuracy while controlling cost; PEFT and quantisation let you ship tailored models without enterprise-scale GPU fleets.

Key finding

QLoRA compresses model parameters and enables 4-bit fine-tuning while retaining near-16-bit performance.

Numbers: Reduces to ~5.2 bits/parameter (from 96 bits); ~18x memory reduction

Clinical Camel: an open medical LLM fine-tuned with dialogue synthesis and single‑GPU QLoRA

0.20

0.60

0.40

35

An open, high-performing medical LLM reduces vendor lock-in, enables internal validation, and can be reproduced with modest compute, letting institutions experiment safely before any clinical adoption.

Key finding

Clinical Camel-70B beats GPT-3.5 on several medical QA benchmarks in five-shot tests.

Numbers: USMLE 64.3% vs GPT-3.5 58.5%; PubMedQA 77.9% vs 60.2%

Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

0.75

0.50

0.80

28

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Key finding

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

Numbers: LoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

0.50

24

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Key finding

Large, law-specific SFT dataset built for training.

Numbers: DISC-Law-SFT total size 403K samples

KnowEdit benchmark and EasyEdit toolkit: a unified study and comparison of methods to change facts inside LLMs

0.50

0.70

0.60

20

Knowledge editing can cheaply update specific facts or behaviors in an LLM without full retraining, saving compute and time; but edits can fail to generalize and may break unrelated behavior, so careful validation is required.

Key finding

Several editing methods can reach near-perfect edit success on fact-insertion and fact-modification datasets.

Numbers: WikiData recent edit success: AdaLoRA=100, FT-M=100 (Table 4)

LawBench: a 20-task Chinese legal benchmark measuring memorization, understanding, and application by 51 LLMs

0.30

0.35

0.40

19

LawBench shows that even top LLMs are unreliable for legal judgments; businesses should treat model outputs as draft assistance, not legal advice, and validate with experts.

Key finding

GPT-4 is the best model on LawBench but far from perfect

Numbers: GPT-4 average zero-shot 52.35 (Table 26)

A simple, compute-efficient loop that generates model outputs, filters them by a learned reward, and fine-tunes the model offline to align L

0.60

0.70

18

ReST boosts model alignment with human preferences using offline data reuse, cutting the compute cost of repeated online RL sampling while improving translation quality and human-rated outputs.

Key finding

Each additional Improve step raises the model's average reward on validation.

Numbers: Figure 3: steady increases across IWSLT, WMT, Web Domain (rewards normalized 0–100)

A fast finetuning recipe that makes a large LLM 'forget' Harry Potter while keeping general skills

0.40

0.70

12

You can remove copyrighted or sensitive text from a large LLM with a short, targeted finetune instead of full retraining, cutting compute from hundreds of thousands of GPU-hours to minutes–hours for the targeted edit.

Key finding

The method dramatically reduces model 'familiarity' with Harry Potter as measured by completion-based tests.

Numbers: Familiarity (completion): 0.29 → 0.007 after ~120 finetuning steps

SPIN: let a supervised-finetuned LLM play against itself to improve without new human labels

0.60

0.70

0.50

11

SPIN can raise model quality using only existing supervised labels, cutting cost for collecting preference labels while needing extra compute to generate synthetic data.

Key finding

SPIN raises average Open LLM Leaderboard score starting from a SFT checkpoint.

Numbers: 58.14 → 63.16 average (Open LLM Leaderboard)

Fine-tune a vision+LLM for medical VQA and report writing using LoRA and one projection layer

0.50

0.60

0.70

11

You can adapt large vision+LLM models to medical VQA and report generation cheaply by tuning small adapter layers; use GPT-4 for scalable semantic QA evaluation instead of brittle lexical metrics.

Key finding

PEFT keeps trainable footprint tiny: only projection + LoRA updated.

Numbers: 56.63M trainable vs 7B full LLM

ChiMed‑GPT: a 13B Chinese medical LLM trained with pretraining, SFT and RLHF for safer, better medical answers

0.60

0.45

0.50

10

ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.

Key finding

Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.

Numbers: BLEU-1 33.14 (ChiMed‑GPT) vs 24.29 (GPT-4)

FS-LLM: an open toolbox to run, benchmark and speed up federated fine‑tuning of LLMs

0.60

0.50

0.70

9

FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.

Key finding

LoRA gives the strongest PEFT results across domains in FL.

Numbers: Fed LLaMA-7B: LoRA 13.29% vs P-tuning 9.71% and prompt 9.63% on Fed-CodeAlpaca (Pass@1)

Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

0.60

0.80

9

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Key finding

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbers: harmful rate 47% -> 1% (OPT-1.3B, Table 3)

Fine-tune LLaMA2 with context and video descriptions to improve emotion recognition in conversations

0.60

0.70

9

Fine-tuning an open 7B LLM with emotion and context data gives SOTA emotion detection while staying cheap to train, enabling faster builds of emotion-aware agents and analytics.

Key finding

DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.

Numbers: MELD Acc 71.96%, F1 71.90; IEMOCAP Acc 70.62%, F1 69.93; EmoryNLP Acc 41.88%, F1 40.05

Benchmarking zeroth-order (no-backprop) optimizers to cut LLM fine-tuning memory and explore practical trade-offs

0.60

0.70

7

ZO methods let multi-billion LLM fine-tuning run with much lower peak memory—often on a single high-memory GPU—reducing cloud costs and enabling training in constrained environments. But expect accuracy or compute trade-offs on hard tasks.

Key finding

ZO optimizers cut peak memory by a large margin vs standard FO optimizers.

Numbers: ZO-SGD 64 GB vs FO-SGD 148 GB (peak) on OPT-13B/MultiRC

Find code security bugs while the developer types using transformer models

0.75

0.45

0.70

7

Catching vulnerabilities while code is being written shortens fix time and cost; the paper shows large reductions in vulnerable completions from code LMs and near-90% reduction in production JS edits when integrated into an editor.

Key finding

Fine-tuned CodeBERT (DeepDevVuln) has the best F1 balance on their GitHub PR test set.

Numbers: Precision 58.87%, Recall 63.00%, F1 60.87% (Table 3)

GLiNER: small bidirectional model that outperforms ChatGPT on zero-shot open-type NER

0.70

0.60

0.80

6

GLiNER gives production-ready open-type NER with 50M–300M models that beat ChatGPT zero-shot, cutting cost and latency while keeping competitive accuracy.

Key finding

GLiNER-L (300M) achieves average F1 60.9 on the OOD NER benchmark, outperforming ChatGPT.

Numbers: Avg F1 60.9 vs ChatGPT 47.5 (+13.4) — Table 1

Train LLMs on private data with federated learning; OpenFedLLM shows FL beats local training and can beat GPT‑4 in finance

0.60

0.55

0.65

6

Companies with private domain data can jointly fine-tune LLMs privately and get measurable gains over solo training; finance firms, hospitals, and firms with sensitive data can gain domain-leading models without sharing raw data.

Key finding

Federated learning consistently improves over single-client local fine-tuning across tasks.

Numbers: multiple tables: e.g., Table 4 MT-Avg FedAvg 3.346 vs Local 2.844 (open-ended)

GPT-4 can act as an automatic clinical judge of ophthalmology chatbot answers; fine-tuning helps but can also harm generalization

0.40

0.30

0.45

6

Automated LLM evaluation with GPT-4 can scale clinical-quality review and spot dangerous hallucinations, cutting manual grading costs and speeding iteration for healthcare chatbots; validate automated scores against clinicians before releasing patient-facing features.

Key finding

GPT-4 automated rankings strongly matched clinician rankings on the test set.

Numbers: Spearman ≈0.90; Kendall Tau ≈0.80; Cohen's Kappa ≈0.50

Aloe: open 7B–8B medical LLMs using synthetic Chain-of-Thought, model merging and Direct Preference Optimization

0.60

0.50

0.60

6

Aloe shows practical, low‑cost ways to push open medical LLMs: generate CoT examples with a stronger model, merge fine‑tuned variants, and use retrieval‑style prompting to get 2–7 point accuracy gains without larger models or expensive pretraining.

Key finding

Aloe's aligned 8B variant outperforms Llama‑3‑8B‑Instruct across medical benchmarks at this size.

Numbers: Zero‑shot avg: 70.25 vs 68.89 (Llama‑3‑8B) — Table 3

Fast, low-cost debiasing by estimating harmful training samples and 'unlearning' them

0.60

0.70

0.80

6

FMD lets teams reduce model bias quickly and cheaply by changing only a small external counterfactual set and a few classifier parameters, avoiding costly full retraining or large-scale relabeling.

Key finding

On Colored MNIST (bias ratio 0.99) FMD attains nearly the same accuracy as strong baselines while lowering measured counterfactual bias.

Numbers: Acc 80.04% vs 80.41%; Bias 0.2042 vs 0.2302; Time 48s vs 1658s; Samples 5k vs 50k

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

0.70

0.40

0.70

5

SWIFT unifies fine-tuning, RL-style alignment, quantization and deployment for text and multimodal models. That reduces engineering overhead, accelerates experiments on agents and lets teams run many models and tuners without building custom glue code.

Key finding

SWIFT already supports a very large model and dataset surface.

Numbers: 550+ LLMs, 200+ MLLMs, ~150+ datasets (paper claims)

Combine Transformer + knowledge distillation to shrink models while keeping high GLUE accuracy (reported 98.32% Acc)

0.40

0.30

0.60

5

You can shrink model footprint and inference cost while keeping high accuracy by distilling a Transformer into a smaller model, lowering hardware and energy bills for production NLP services.

Key finding

TKD-NLP reports top GLUE numbers among tested models.

Numbers: Acc 98.32%; F1 97.14% on GLUE

Use a fine-tuned language model plus spatiotemporal patching to predict 2D unsteady fluid flow faster and with lower error than prior ML sur

0.60

0.50

5

FLUID-LLM can cut multi-step prediction error for 2D CFD tasks and adapt from short context histories, helping engineering teams get fast, accurate surrogates without full solver runs.

Key finding

Scaling the LLM reduced long-horizon error on the Cylinder dataset.

Numbers: RMSE at 150 steps: FLUID-OPT125m=0.102 → FLUID-OPT2.7b=0.059 (≈42% reduction)