SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

August 10, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

5

Authors

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Hong Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, Yingda Chen

Links

Abstract / PDF

Why It Matters For Business

SWIFT unifies fine-tuning, RL-style alignment, quantization and deployment for text and multimodal models. That reduces engineering overhead, accelerates experiments on agents and lets teams run many models and tuners without building custom glue code.

Summary TLDR

SWIFT is an open-source, end-to-end toolkit from ModelScope that unifies lightweight fine-tuning, reinforced fine-tuning (RLHF/GRPO), quantization, evaluation and deployment for text and multimodal foundation models. It supports hundreds of LLMs/MLLMs, many PEFT-style tuners, QLoRA-style quantized training, multi-backend inference (vLLM, LMDeploy, PyTorch), and agent-specific datasets/formats (ToolBench). Benchmarks inside the paper show large memory and parameter savings from tuners (e.g., LoRA vs full-parameter) and consistent agent-task gains after fine-tuning (Act.EM and hallucination rate improvements on ToolBench).

Problem Statement

Fine-tuning and running large text and multimodal foundation models is fragmented: many model types, tuners, quantizers, evaluation tools and deployment backends exist and are hard to combine. Developers need a single, practical pipeline that covers lightweight training, RL-style fine-tuning, quantization, evaluation and deployment for both text and multi-modal models.

Main Contribution

An open-source training and deployment framework (SWIFT) integrating PEFT tuners, quantization (QLoRA-style), RLHF/GRPO and multi-backend inference; supports 550+ LLMs and 200+ MLLMs.

Systematic support for multi-modal fine-tuning and RLHF, including dataset templates, tuner compatibility, and model patchers to reduce integration friction.

A collection of implemented tuners, new optimizer integrations, export/merge utilities (LoRA merge, GPTQ/AWQ/BNB quantize), and a Web UI that builds and runs standard commands.

Benchmarks and ablations: (a) tuner memory/speed/loss profiles on qwen-7b-chat; (b) agent training results on ToolBench showing Act.EM, Plan.EM and hallucination metric improvements.

Key Findings

SWIFT already supports a very large model and dataset surface.

Numbers550+ LLMs, 200+ MLLMs, ~150+ datasets (paper claims)

LoRA-style tuners cut trainable parameter size dramatically vs full training.

NumbersLoRA trainable: 17.89M vs Full: 7721.32M params (0.23% vs 100%)

Memory usage can fall substantially using lightweight tuners.

NumbersFull memory 73.53 GiB -> LISA 31.11 GiB (example) and Q-GaLore 41.53 GiB

Some tuners trade speed for lower eval loss; LISA is fastest here.

NumbersSpeed: LISA 2.66 samples/s vs Full 1.43; Eval loss LISA 1.06 vs LoRA+ 0.98 (lower is better)

Agent fine-tuning on ToolBench improves action accuracy and reduces hallucination.

NumbersQwen2-7b-instruct: Act.EM 54.74 -> Full 60.01 (+5.27 abs, +9.6% rel); hallucination 4.16% -> 2.58%

Loss-scale weighting for agent tokens improved multiple metrics in ablation.

NumbersLLaMA3 LoRA Act.EM in-domain: 55.71 -> 58.15 with loss-scale (abs +2.44)

Results

Trainable params

ValueLoRA 17.89M; Full 7721.32M

BaselineFull 7721.32M

Memory (GiB)

ValueFull 73.53 GiB; LISA 31.11 GiB; Q-GaLore 41.53 GiB

BaselineFull 73.53 GiB

Throughput (samples/s)

ValueLISA 2.66; Full 1.43

BaselineFull 1.43

Agent Act.EM

ValueQwen2-7b: Original 54.74 -> Full 60.01

BaselineOriginal 54.74

Agent Act.EM (LLaMA3)

ValueLLaMA3 Original 57.67 -> LoRA 58.91 -> Full 60.14

BaselineOriginal 57.67

Hallucination rate

ValueQwen2-7b: Original 4.16% -> LoRA 0.9% -> Full 2.58%

BaselineOriginal 4.16%

Who Should Care

What To Try In 7 Days

Install SWIFT and run a quick 'swift sample' on a supported model to inspect sampling and templates.

Run a QLoRA + LoRA fine-tune on a small domain dataset to test memory and merge/quantize flow.

Evaluate a post-tuned LoRA checkpoint on ToolBench or a small agent dataset to measure Act.EM and hallucination changes.

Agent Features

Memory

  • LoRA
  • gradient-sharding (DeepSpeed/FSDP)

Planning

  • GRPO
  • DPO
  • ORPO
  • KTO

Tool Use

  • ToolBench format
  • ReACT format
  • function calling / tools field

Frameworks

  • TRL
  • vLLM
  • FastAPI

Is Agentic

true

Architectures

  • tool-augmented LLM
  • multimodal LLM
  • decoder-only transformer

Collaboration

  • multi-round rollouts (actor/collector placement)
  • replay buffer and colocate actor modes

Optimization Features

Token Efficiency

  • loss-scale weighting for agent tokens

Infra Optimization

  • Megatron checkpoint conversion and parallel pretraining support
  • support for single-node multi-GPU and multi-node multi-GPU

Model Optimization

  • LoRA
  • LLaMA-Pro block expansion
  • Mamba/SSM support

System Optimization

  • DeepSpeed Zero / FSDP integration
  • offloading inactive tuners to CPU/meta devices

Training Optimization

  • LoRA
  • GaLore / Q-GaLore gradient low-rank projection
  • LISA layerwise sampling
  • sequence parallelism and gradient checkpointing

Inference Optimization

  • vLLM backend support
  • LMDeploy integration
  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Incomplete Megatron large-scale parallel coverage; pretraining support not full for all major LLMs.
  • RAG systems are not yet supported for training-enhancements (noted as planned future work).
  • Quantization and tuner compatibility may still require model-specific patching or fixes.
  • Multi-modal dataset and deep multimodal agent research are limited and listed as future work.

When Not To Use

  • If you need production-grade Megatron pretraining workflows across mainstream billion-parameter models today.
  • If your pipeline requires built-in RAG training integrations right now.
  • If you rely exclusively on closed-source models with no checkpoint export for PEFT/LoRA workflows.

Failure Modes

  • Model-specific dtype/patching issues during load that require template/patcher adjustments.
  • Quantization methods may not generalize across all model families, causing accuracy drops.
  • Mixed tuners or combined optimizers could interact poorly without careful hyper-parameter tuning.
  • Loss-scale weighting can over-emphasize tokens if misconfigured, harming generalization.

Core Entities

Models

  • Qwen-7B
  • Qwen2-7B-instruct
  • qwen-7b-chat
  • Qwen2.5-VL
  • LLaMA3-8b-instruct
  • LLaMA series
  • Mamba
  • Megatron
  • LLaVA
  • Gemma
  • InternVL

Metrics

  • Train loss
  • Eval loss
  • Memory (GiB)
  • Samples/s
  • Trainable params (M)
  • Plan.EM
  • Act.EM
  • Hallu Rate
  • Avg.F1
  • ROUGE
  • BLEU

Datasets

  • MSAgent
  • MSAgent-Pro
  • ToolBench
  • AgentFlan
  • alpaca-en
  • firefly-train-1.1M
  • open-r1/verifiable-coding-problems-python-10k

Benchmarks

  • ToolBench
  • Lightweight tuner benchmark (qwen-7b-chat)
  • EvalScope/OpenCompass evaluation sets

Context Entities

Models

  • GPT-4
  • GPT family
  • LLaMA-Pro

Metrics

  • Pass@K
  • margin
  • logps

Datasets

  • CEval
  • gsm8k
  • MMLU
  • COCO_VAL

Benchmarks

  • Pass@K (mentioned)
  • ToolBench leaderboard