Practical survey of how to combine fine-tuned LLMs into one model without retraining

March 10, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.78

Citation Count

0

Authors

Mingyang Song, Mao Zheng

Links

Abstract / PDF

Why It Matters For Business

Model merging lets you cheaply combine specialist LLMs into one deployable model, saving training cost and inference overhead while enabling rapid capability composition.

Summary TLDR

This survey organizes model merging — combining multiple trained models into one — around a four-part FUSE taxonomy: Foundations (why merging works), Unification strategies (how to merge), Scenarios (where merging helps), and Ecosystem (tools and gaps). It reviews weight averaging, task-vector arithmetic, sparsification (e.g., TIES, DARE), manifold-aware interpolation (SLERP), MoE-style expert routing, and automated search. It summarizes empirical strengths, common failure modes (interference, sign conflicts, permutation symmetry), practical toolkits (mergekit, Model Soups, LoRA-based hubs), and open problems in scalability, evaluation, and cross-architecture merging.

Problem Statement

Fine-tuned LLMs are proliferating, but training or ensembles are costly. Practitioners need ways to combine existing specialized models into one unified model cheaply and reliably. The core challenges are weight-space symmetries (permutation), parameter interference (sign/magnitude conflicts), need for shared initialization, architectural mismatch, and missing standardized evaluation.

Main Contribution

Proposes the FUSE taxonomy (Foundations, Unification, Scenarios, Ecosystem) to structure model merging research.

Systematically reviews algorithmic families: weight averaging, task-vector arithmetic, sparsification, geometric interpolation, MoE-style routing, and search-based merging.

Synthesizes theory (loss landscapes, mode connectivity, permutation symmetry) and maps practical prerequisites for successful merges.

Surveys applications across multi-tasking, alignment/safety, multilingual transfer, federated learning, and deployment tradeoffs.

Identifies ecosystem resources and key open challenges: scalability, cross-architecture merging, automated merge prediction, and standardized benchmarks.

Key Findings

Merging can preserve most task performance when models share a pretrained initialization.

NumbersMerged accuracy often within 2–3% of individually fine-tuned models (reported experiments)

Sparsification methods reduce interference and enable larger multi-model merges.

NumbersDARE preserved >90% task performance when merging up to six specialized LLMs

Automated search recovers near-optimal merges with far fewer evaluations.

NumbersCMA-ES search recovered 85–95% of oracle merge at ~10× lower cost vs exhaustive search

Merging can substantially help multilingual and low-resource transfer.

NumbersInterpolating English and target-language models improved XNLI by ~15–20%

Merging carries safety and dual-use risks; naive merges can introduce or amplify vulnerabilities.

NumbersReward averaging reduced reward-hacking by ~4–7% (RewardBench experiments)

Task dissimilarity predicts merge failure: highly dissimilar tasks can cause large performance drops.

NumbersPairs with high dissimilarity saw up to ~25% performance drops on some tasks

Results

Accuracy

Valuewithin 2–3% on evaluated benchmarks

Baselineindividual fine-tuned models

DARE retention

Value>90% task performance when merging up to six LLMs

Baselineindividual task models

Evolutionary search efficiency

Value85–95% of oracle with ~10× fewer evaluations

Baselineexhaustive search (oracle)

Multilingual transfer improvement

Value+15–20% on XNLI for low-resource languages

BaselineEnglish-only fine-tuned baseline

Reward-model averaging

Value4–7% reduction in reward-hacking (RewardBench)

Baselinesingle reward model

Who Should Care

What To Try In 7 Days

Run Model Soups: average a few compatible fine-tuned checkpoints and validate on a held-out set.

Extract a task vector (fine-tuned minus base) and add/subtract it to modulate capability.

Apply TIES trimming on two task vectors to test sparsification and compare retention ratios.

Optimization Features

Token Efficiency

  • top-k expert activation to reduce FLOPs

Infra Optimization

  • sparse parameter transfer for bandwidth-limited federated learning

Model Optimization

  • weight averaging
  • task-vector arithmetic
  • MoE
  • LoRA

System Optimization

  • hierarchical aggregation for federated merges
  • proxy evaluation for evolutionary search

Training Optimization

  • trajectory averaging (SWA, EMA)
  • lookahead optimizer

Inference Optimization

  • sparse top-k routing
  • LoRA

Reproducibility

Code Urls

  • mergekit (Goddard et al., 2024)
  • Model Soups (Wortsman et al., 2022a)
  • TIES-Merging (Yadav et al., 2023)
  • LoRA / LoRAHub

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires architectural identity or adapter-based alternatives for cross-model merges.
  • Shared pretrained initialization is often necessary for reliable linear combinations.
  • Interference (sign conflicts, magnitude imbalance) can cause major task degradation.
  • Scalability gaps: alignment and permutation-solving costs grow with model size and number of experts.
  • Evaluation gaps: no single standardized benchmark captures interference and emergent properties fully.

When Not To Use

  • Source models trained from different random initializations without alignment tools.
  • Strongly conflicting task specializations without interference mitigation.
  • Safety-critical deployment without comprehensive red-teaming and calibration checks.
  • Memory-constrained edge deployment when MoE structural preservation would increase model size.

Failure Modes

  • Negative transfer: merged model performs worse than each constituent on its task.
  • Emergent unsafe behavior or amplified backdoors from constituent models.
  • Router collapse in MoE: all inputs routed to one expert, losing specialization.
  • Catastrophic forgetting of less-dominant tasks when magnitude imbalances exist.
  • Misleading validation: held-out proxy metrics fail under real distribution shift.

Core Entities

Models

  • LLaMA
  • LLaMA-2
  • LLaMA-3
  • Mistral
  • Mixtral
  • Qwen
  • Qwen2
  • DeepSeek
  • Gemma
  • CLIP
  • WizardLM
  • Code Llama

Metrics

  • task retention ratio (TRR)
  • geometric mean retention (R_geo)
  • Accuracy
  • expected calibration error (ECE)

Datasets

  • GSM8K
  • MMLU
  • HumanEval
  • TruthfulQA
  • HellaSwag
  • XNLI
  • ARC
  • WinoGrande

Benchmarks

  • Open LLM Leaderboard
  • FusionBench
  • FusionBench / FusionBench suite
  • RewardBench