Practical survey of how to combine fine-tuned LLMs into one model without retraining

March 10, 20268 min

Overview

Decision SnapshotNeeds Validation

Merging is production-ready for settings where models share a pretrained base and tasks are moderately related; advanced methods (sparsification, routing, search) address failures but add complexity and compute.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals14

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 78%

Production readiness: 70%

Novelty: 65%

Authors

Mingyang Song, Mao Zheng

Links

Abstract / PDF / Code

Why It Matters For Business

Model merging lets you cheaply combine specialist LLMs into one deployable model, saving training cost and inference overhead while enabling rapid capability composition.

Who Should Care

Summary TLDR

This survey organizes model merging — combining multiple trained models into one — around a four-part FUSE taxonomy: Foundations (why merging works), Unification strategies (how to merge), Scenarios (where merging helps), and Ecosystem (tools and gaps). It reviews weight averaging, task-vector arithmetic, sparsification (e.g., TIES, DARE), manifold-aware interpolation (SLERP), MoE-style expert routing, and automated search. It summarizes empirical strengths, common failure modes (interference, sign conflicts, permutation symmetry), practical toolkits (mergekit, Model Soups, LoRA-based hubs), and open problems in scalability, evaluation, and cross-architecture merging.

Problem Statement

Fine-tuned LLMs are proliferating, but training or ensembles are costly. Practitioners need ways to combine existing specialized models into one unified model cheaply and reliably. The core challenges are weight-space symmetries (permutation), parameter interference (sign/magnitude conflicts), need for shared initialization, architectural mismatch, and missing standardized evaluation.

Main Contribution

Proposes the FUSE taxonomy (Foundations, Unification, Scenarios, Ecosystem) to structure model merging research.

Systematically reviews algorithmic families: weight averaging, task-vector arithmetic, sparsification, geometric interpolation, MoE-style routing, and search-based merging.

Key Findings

Merging can preserve most task performance when models share a pretrained initialization.

NumbersMerged accuracy often within 23% of individually fine-tuned models (reported experiments)

Practical UseIf your models were fine-tuned from the same base, try weight or task-vector averaging first; you may avoid retraining with only small accuracy loss.

Evidence RefSection 6.1 (Ilharco et al., 2023; text)

Sparsification methods reduce interference and enable larger multi-model merges.

NumbersDARE preserved >90% task performance when merging up to six specialized LLMs

Practical UseWhen merging many specialized models, use sparsification (trim or drop-and-rescale) to keep task quality high.

Evidence RefSection 6.1 (Yu et al., 2023; text)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracywithin 23% on evaluated benchmarksindividual fine-tuned models−2% to −3% typicalmulti-task NLP benchmarks (reported experiments)Section 6.1 (Ilharco et al., 2023; survey synthesis)Section 6.1
DARE retention>90% task performance when merging up to six LLMsindividual task modelsmulti-task merging experiments (text)Section 6.1 (Yu et al., 2023; text)Section 6.1

What To Try In 7 Days

Run Model Soups: average a few compatible fine-tuned checkpoints and validate on a held-out set.

Extract a task vector (fine-tuned minus base) and add/subtract it to modulate capability.

Apply TIES trimming on two task vectors to test sparsification and compare retention ratios.

Optimization Features

Token Efficiency
top-k expert activation to reduce FLOPs
Infra Optimization
sparse parameter transfer for bandwidth-limited federated learning
Model Optimization
weight averagingtask-vector arithmeticMoELoRA
System Optimization
hierarchical aggregation for federated mergesproxy evaluation for evolutionary search
Training Optimization
trajectory averaging (SWA, EMA)lookahead optimizer
Inference Optimization
sparse top-k routingLoRA

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

mergekit (Goddard et al., 2024)Model Soups (Wortsman et al., 2022a)TIES-Merging (Yadav et al., 2023)LoRA / LoRAHub

Risks & Boundaries

Limitations

Requires architectural identity or adapter-based alternatives for cross-model merges.

Shared pretrained initialization is often necessary for reliable linear combinations.

When Not To Use

Source models trained from different random initializations without alignment tools.

Strongly conflicting task specializations without interference mitigation.

Failure Modes

Negative transfer: merged model performs worse than each constituent on its task.

Emergent unsafe behavior or amplified backdoors from constituent models.

Core Entities

Models

LLaMALLaMA-2LLaMA-3MistralMixtralQwenQwen2DeepSeekGemmaCLIPWizardLMCode Llama

Metrics

task retention ratio (TRR)geometric mean retention (R_geo)Accuracyexpected calibration error (ECE)

Datasets

GSM8KMMLUHumanEvalTruthfulQAHellaSwagXNLIARCWinoGrande

Benchmarks

Open LLM LeaderboardFusionBenchFusionBench / FusionBench suiteRewardBench