Practical survey of how to combine fine-tuned LLMs into one model without retraining

Overview

Decision SnapshotNeeds Validation

Merging is production-ready for settings where models share a pretrained base and tasks are moderately related; advanced methods (sparsification, routing, search) address failures but add complexity and compute.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals14

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 78%

Production readiness: 70%

Novelty: 65%

Authors

Mingyang Song, Mao Zheng

Links

Abstract / PDF / Code

Why It Matters For Business

Model merging lets you cheaply combine specialist LLMs into one deployable model, saving training cost and inference overhead while enabling rapid capability composition.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

This survey organizes model merging — combining multiple trained models into one — around a four-part FUSE taxonomy: Foundations (why merging works), Unification strategies (how to merge), Scenarios (where merging helps), and Ecosystem (tools and gaps). It reviews weight averaging, task-vector arithmetic, sparsification (e.g., TIES, DARE), manifold-aware interpolation (SLERP), MoE-style expert routing, and automated search. It summarizes empirical strengths, common failure modes (interference, sign conflicts, permutation symmetry), practical toolkits (mergekit, Model Soups, LoRA-based hubs), and open problems in scalability, evaluation, and cross-architecture merging.

Problem Statement

Fine-tuned LLMs are proliferating, but training or ensembles are costly. Practitioners need ways to combine existing specialized models into one unified model cheaply and reliably. The core challenges are weight-space symmetries (permutation), parameter interference (sign/magnitude conflicts), need for shared initialization, architectural mismatch, and missing standardized evaluation.

Main Contribution

Proposes the FUSE taxonomy (Foundations, Unification, Scenarios, Ecosystem) to structure model merging research.

Systematically reviews algorithmic families: weight averaging, task-vector arithmetic, sparsification, geometric interpolation, MoE-style routing, and search-based merging.

Key Findings

Merging can preserve most task performance when models share a pretrained initialization.

NumbersMerged accuracy often within 2–3% of individually fine-tuned models (reported experiments)

Practical UseIf your models were fine-tuned from the same base, try weight or task-vector averaging first; you may avoid retraining with only small accuracy loss.

Evidence RefSection 6.1 (Ilharco et al., 2023; text)

Sparsification methods reduce interference and enable larger multi-model merges.

NumbersDARE preserved >90% task performance when merging up to six specialized LLMs

Practical UseWhen merging many specialized models, use sparsification (trim or drop-and-rescale) to keep task quality high.

Evidence RefSection 6.1 (Yu et al., 2023; text)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	within 2–3% on evaluated benchmarks	individual fine-tuned models	−2% to −3% typical	multi-task NLP benchmarks (reported experiments)	Section 6.1 (Ilharco et al., 2023; survey synthesis)	Section 6.1
DARE retention	>90% task performance when merging up to six LLMs	individual task models	—	multi-task merging experiments (text)	Section 6.1 (Yu et al., 2023; text)	Section 6.1

What To Try In 7 Days

Run Model Soups: average a few compatible fine-tuned checkpoints and validate on a held-out set.

Extract a task vector (fine-tuned minus base) and add/subtract it to modulate capability.

Apply TIES trimming on two task vectors to test sparsification and compare retention ratios.

Optimization Features

Token Efficiency

top-k expert activation to reduce FLOPs

Infra Optimization

sparse parameter transfer for bandwidth-limited federated learning

Model Optimization

weight averagingtask-vector arithmeticMoELoRA

System Optimization

hierarchical aggregation for federated mergesproxy evaluation for evolutionary search

Training Optimization

trajectory averaging (SWA, EMA)lookahead optimizer

Inference Optimization

sparse top-k routingLoRA

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

mergekit (Goddard et al., 2024)Model Soups (Wortsman et al., 2022a)TIES-Merging (Yadav et al., 2023)LoRA / LoRAHub

Risks & Boundaries

Limitations

Requires architectural identity or adapter-based alternatives for cross-model merges.

Shared pretrained initialization is often necessary for reliable linear combinations.

When Not To Use

Source models trained from different random initializations without alignment tools.

Strongly conflicting task specializations without interference mitigation.

Failure Modes

Negative transfer: merged model performs worse than each constituent on its task.

Emergent unsafe behavior or amplified backdoors from constituent models.

Core Entities

Models

LLaMALLaMA-2LLaMA-3MistralMixtralQwenQwen2DeepSeekGemmaCLIPWizardLMCode Llama

Metrics

task retention ratio (TRR)geometric mean retention (R_geo)Accuracyexpected calibration error (ECE)

Datasets

GSM8KMMLUHumanEvalTruthfulQAHellaSwagXNLIARCWinoGrande

Benchmarks

Open LLM LeaderboardFusionBenchFusionBench / FusionBench suiteRewardBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Merging can preserve most task performance when models share a pretrained initialization.

Sparsification methods reduce interference and enable larger multi-model merges.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Measure many LLMs with only a few test items by learning weighted anchors

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Key finding

FlexiGPT: prune or extend LLMs by replacing blocks with low-rank weight-sharing and LoRA adapters

Key finding