Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.78
Citation Count
0
Why It Matters For Business
Model merging lets you cheaply combine specialist LLMs into one deployable model, saving training cost and inference overhead while enabling rapid capability composition.
Summary TLDR
This survey organizes model merging — combining multiple trained models into one — around a four-part FUSE taxonomy: Foundations (why merging works), Unification strategies (how to merge), Scenarios (where merging helps), and Ecosystem (tools and gaps). It reviews weight averaging, task-vector arithmetic, sparsification (e.g., TIES, DARE), manifold-aware interpolation (SLERP), MoE-style expert routing, and automated search. It summarizes empirical strengths, common failure modes (interference, sign conflicts, permutation symmetry), practical toolkits (mergekit, Model Soups, LoRA-based hubs), and open problems in scalability, evaluation, and cross-architecture merging.
Problem Statement
Fine-tuned LLMs are proliferating, but training or ensembles are costly. Practitioners need ways to combine existing specialized models into one unified model cheaply and reliably. The core challenges are weight-space symmetries (permutation), parameter interference (sign/magnitude conflicts), need for shared initialization, architectural mismatch, and missing standardized evaluation.
Main Contribution
Proposes the FUSE taxonomy (Foundations, Unification, Scenarios, Ecosystem) to structure model merging research.
Systematically reviews algorithmic families: weight averaging, task-vector arithmetic, sparsification, geometric interpolation, MoE-style routing, and search-based merging.
Synthesizes theory (loss landscapes, mode connectivity, permutation symmetry) and maps practical prerequisites for successful merges.
Surveys applications across multi-tasking, alignment/safety, multilingual transfer, federated learning, and deployment tradeoffs.
Identifies ecosystem resources and key open challenges: scalability, cross-architecture merging, automated merge prediction, and standardized benchmarks.
Key Findings
Merging can preserve most task performance when models share a pretrained initialization.
Sparsification methods reduce interference and enable larger multi-model merges.
Automated search recovers near-optimal merges with far fewer evaluations.
Merging can substantially help multilingual and low-resource transfer.
Merging carries safety and dual-use risks; naive merges can introduce or amplify vulnerabilities.
Task dissimilarity predicts merge failure: highly dissimilar tasks can cause large performance drops.
Results
Accuracy
DARE retention
Evolutionary search efficiency
Multilingual transfer improvement
Reward-model averaging
Who Should Care
What To Try In 7 Days
Run Model Soups: average a few compatible fine-tuned checkpoints and validate on a held-out set.
Extract a task vector (fine-tuned minus base) and add/subtract it to modulate capability.
Apply TIES trimming on two task vectors to test sparsification and compare retention ratios.
Optimization Features
Token Efficiency
- top-k expert activation to reduce FLOPs
Infra Optimization
- sparse parameter transfer for bandwidth-limited federated learning
Model Optimization
- weight averaging
- task-vector arithmetic
- MoE
- LoRA
System Optimization
- hierarchical aggregation for federated merges
- proxy evaluation for evolutionary search
Training Optimization
- trajectory averaging (SWA, EMA)
- lookahead optimizer
Inference Optimization
- sparse top-k routing
- LoRA
Reproducibility
Code Urls
- mergekit (Goddard et al., 2024)
- Model Soups (Wortsman et al., 2022a)
- TIES-Merging (Yadav et al., 2023)
- LoRA / LoRAHub
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires architectural identity or adapter-based alternatives for cross-model merges.
- Shared pretrained initialization is often necessary for reliable linear combinations.
- Interference (sign conflicts, magnitude imbalance) can cause major task degradation.
- Scalability gaps: alignment and permutation-solving costs grow with model size and number of experts.
- Evaluation gaps: no single standardized benchmark captures interference and emergent properties fully.
When Not To Use
- Source models trained from different random initializations without alignment tools.
- Strongly conflicting task specializations without interference mitigation.
- Safety-critical deployment without comprehensive red-teaming and calibration checks.
- Memory-constrained edge deployment when MoE structural preservation would increase model size.
Failure Modes
- Negative transfer: merged model performs worse than each constituent on its task.
- Emergent unsafe behavior or amplified backdoors from constituent models.
- Router collapse in MoE: all inputs routed to one expert, losing specialization.
- Catastrophic forgetting of less-dominant tasks when magnitude imbalances exist.
- Misleading validation: held-out proxy metrics fail under real distribution shift.
Core Entities
Models
- LLaMA
- LLaMA-2
- LLaMA-3
- Mistral
- Mixtral
- Qwen
- Qwen2
- DeepSeek
- Gemma
- CLIP
- WizardLM
- Code Llama
Metrics
- task retention ratio (TRR)
- geometric mean retention (R_geo)
- Accuracy
- expected calibration error (ECE)
Datasets
- GSM8K
- MMLU
- HumanEval
- TruthfulQA
- HellaSwag
- XNLI
- ARC
- WinoGrande
Benchmarks
- Open LLM Leaderboard
- FusionBench
- FusionBench / FusionBench suite
- RewardBench

