Train task-focused supervised fine-tuning and preference alignment in parallel, then sparsify and merge adapters to avoid alignment tax.

June 25, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

4

Authors

Shiva Kumar Pentyala, Zhichao Wang, Bin Bi, Kiran Ramnath, Xiang-Bo Mao, Regunathan Radhakrishnan, Sitaram Asur, Na, Cheng

Links

Abstract / PDF

Why It Matters For Business

PAFT can preserve both task accuracy and alignment without retraining large models end-to-end; companies can run SFT and alignment in parallel, sparsify adapters, and merge them to ship stronger, aligned models faster.

Summary TLDR

PAFT trains supervised fine-tuning (SFT) and preference alignment (DPO/ORPO) in parallel on the same pre-trained model, makes the SFT adapter sparse via an L1 penalty, and then merges the two adapters into a single model. Sparsifying SFT adapters (over 90% sparsity reported) reduces parameter interference during merging and yields stronger merged models. On public benchmarks PAFT-ed models top the HuggingFace Open LLM Leaderboard for the tested size classes and improve AlpacaEval performance versus many baselines.

Problem Statement

Sequentially applying SFT then preference alignment often causes 'alignment tax'—the aligned model loses or degrades capabilities learned by SFT. The paper asks whether training SFT and alignment in parallel, plus sparsifying adapters, reduces that tax and yields a stronger merged model.

Main Contribution

Introduce PAFT: learn SFT and preference-alignment adapters in parallel on the same base model and fuse them by weight merging.

Show SFT adapters are dense while alignment adapters are naturally sparse; add L1 during SFT to push sparsity and reduce interference.

Empirically show PAFT plus sparse SFT adapters and appropriate merging (e.g., TIES) improves public leaderboard and AlpacaEval results.

Key Findings

Parallel training (PAFT) plus L1-sparsified SFT improves merged-model scores versus sequential or standalone training on the 6-task Open LLM suite.

NumbersPAFT (SFTsparse + DPO) avg=0.65243 vs DPO-alone 0.6333 (Mistral-7B)

Inducing sparsity in the SFT adapter greatly reduces merging interference and can yield large gains for some merge methods.

NumbersTIES: PAFT 0.65243 vs Parallel SFT+DPO 0.58928 (Δ≈+0.0631)

L1 regularization can push SFT adapter sparsity to very high levels.

NumbersSFT_sparse sparsity reported >90% (weight threshold 1e-5)

PAFT-ed larger models placed at or near the top of public leaderboards and AlpacaEval comparisons.

NumbersPAFT Ein-70B avg=0.8129 on Open LLM Leaderboard; PAFT 70B AlpacaEval win-rate=26.5%

Results

Avg score on 6-task Open LLM suite (Mistral-7B, TIES merge)

Value0.65243

BaselineDPO-alone 0.6333

TIES merge gap: sparse vs non-sparse (Mistral-7B)

ValuePAFT 0.65243 vs Parallel SFT+DPO 0.58928

BaselineParallel SFT+DPO 0.58928

SFT

Value>90% weights near zero

BaselineSFT without L1 (much lower sparsity)

Open LLM Leaderboard (70B) average

ValuePAFT (Ein-70B) avg=0.8129

BaselineMixtral-8x22B 0.7915

AlpacaEval pairwise WinRate (GPT-4 judge)

ValuePAFT 70B WinRate=26.5%

BaselineGPT-4 (03/14) WinRate=22.1%

Who Should Care

What To Try In 7 Days

Train SFT and DPO adapters in parallel on your base model using LoRA.

Add small L1 regularization (λ≈1e-4 or 1e-3) to SFT to induce sparsity.

Experiment with simple merging (TIES, Task Arithmetic or linear) and evaluate merged model on your core metrics.

Optimization Features

Infra Optimization

  • LoRA

Model Optimization

  • Merge sparse adapters into base weights
  • Use TIES/Task Arithmetic/SLERP merges

System Optimization

  • Avoid retraining full model by merging adapters

Training Optimization

  • SFT
  • LoRA

Inference Optimization

  • Merged single model for inference (no extra runtime adapters)

Reproducibility

Data Urls

  • UltraChat (Zephyr/UltraChat dataset referenced)
  • UltraFeedback (Zephyr/UltraFeedback dataset referenced)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No causal explanation why DPO adapters are naturally sparse and SFT adapters are dense.
  • Scalability and operational workflow for iterative merges in production is underexplored.
  • SFT used only dialogue-type data (UltraChat), reducing format diversity and generality.
  • UltraFeedback uses GPT-4 labels which may contain annotation errors.

When Not To Use

  • You cannot merge adapters reliably due to incompatible architectures or runtime constraints.
  • Your SFT data is not similar to dialogue or is highly out-of-domain relative to alignment data.
  • You lack capacity to evaluate merged models across the required suite of tasks.

Failure Modes

  • Merged model still suffers from parameter interference if SFT sparsity is insufficient.
  • Retraining the merged model can induce catastrophic forgetting of earlier traits.
  • High sparsity may remove useful tiny-weight signals if λ is mis-tuned.

Core Entities

Models

  • Mistral-7B
  • Llama-3-8B
  • Neurotic-7B
  • MoMo70B
  • Ein-70B
  • PAFT-ed 7B
  • PAFT-ed 70B

Metrics

  • Average over ARC/HellaSwag/MMLU/TruthfulQA/Winograde/GSM8K
  • AlpacaEval pairwise win-rate vs GPT-4

Datasets

  • UltraChat
  • UltraFeedback

Benchmarks

  • HuggingFace Open LLM Leaderboard (6-task suite)
  • AlpacaEval