AdaLink: non-intrusive input adapters that match full fine-tuning on many multimodal tasks

October 18, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

5

Authors

Yaqing Wang, Jialin Wu, Tanmaya Dabral, Jiageng Zhang, Geoff Brown, Chun-Ta Lu, Frederick Liu, Yi Liang, Bo Pang, Michael Bendersky, Radu Soricut

Links

Abstract / PDF

Why It Matters For Business

AdaLink cuts adaptation cost and serving complexity by tuning tiny adapters instead of full models, letting teams deploy many task-specific behaviors without copying huge models.

Summary TLDR

This paper introduces AdaLink, a lightweight, input-focused adapter placed between embeddings and transformer blocks. AdaLink tunes only a tiny fraction of parameters (e.g., ~1.05M vs 32B) and keeps the core model frozen. On large multimodal and NLU checkpoints (PaLI-X, T5/FLAN), AdaLink often matches or closely approaches full fine-tuning, beats prompt-tuning, scales linearly with embedding size, and supports per-task and per-modality adapters for safer serving.

Problem Statement

Fine-tuning huge LLMs and VLMs per task is expensive and hard to serve. Intrusive PEFT methods alter internals and complicate deployment. Non-intrusive methods like prompt tuning are easy to serve but often underperform or are unstable. The paper asks: can a non-intrusive, input-centric adapter reach near full-finetune quality while keeping serving simple?

Main Contribution

AdaLink: a non-intrusive adapter placed after embeddings and before transformer blocks (two-layer low-rank MLP).

Showed AdaLink matches or nearly matches full-model fine-tuning on multimodal captioning and VQA when using large or instruction-tuned bases.

Showed AdaLink outperforms prompt tuning within non-intrusive methods and is stable across adapter ranks.

Demonstrated modality-specific adapters reduce interference and keep serving/config complexity low.

Key Findings

AdaLink reaches near full fine-tuning on COCO captioning with instruction-tuned base.

NumbersCIDEr: AdaLink 146.3 vs FT 147.0 (δ -0.65)

AdaLink outperforms prompt tuning on image captioning.

Numbersavg ~+2 CIDEr vs prompt-tuning (Table 1)

Instruction-tuned base models shrink the gap to full fine-tuning.

NumbersVQA avg δ to FT: -0.05 (MMIT) vs -2.58 (raw)

AdaLink gives competitive NLU performance while tuning very few parameters.

NumbersGLUE avg ≈90.6 vs FT ≈90.7; adapter params 0.008M–2M

Results

CIDEr

Value146.3

BaselineFT 147.0

CIDEr

Value146.2

BaselineFT 147.4

Avg δ to FT (VQA)

Value-0.05

BaselineFT

Avg δ to FT (VQA)

Value-2.58

BaselineFT

GLUE avg

Value≈90.6

BaselineFT ≈90.7

Who Should Care

What To Try In 7 Days

Attach modality-specific AdaLink modules to your base model and run a small validation on a representative multimodal task.

If available, start from an instruction-tuned checkpoint (FLAN/MMIT) to shrink the performance gap quickly.

Replace existing prompt-tuning experiments with AdaLink and compare CIDEr/accuracy on a small holdout.

Optimization Features

Token Efficiency

  • Does not add soft prompt tokens

Infra Optimization

  • Smaller tunable-weights footprint than intrusive PEFT

Model Optimization

  • Parameter-efficient fine-tuning
  • Low-rank adapters

System Optimization

  • Configurable serving as intermediate unit or embedding transform

Training Optimization

  • Instruction-tuning reduces adapter size needs
  • Per-modality adapters to isolate interference

Inference Optimization

  • No change to transformer internals (simpler serving)
  • Computational cost linear in embedding dim

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Less expressive than intrusive PEFT or full fine-tuning; small residual gaps remain on some raw checkpoints.
  • Performance improves markedly with instruction-tuned bases; raw bases can show larger gaps (e.g., up to ~2.6 points on some VQA metrics).
  • Adds adapter parameters and storage per task/modality (trade-off vs copying whole model).
  • Doesn't remove need to freeze or tune upstream encoders (paper froze ViT in experiments).

When Not To Use

  • When absolute best possible metric is required and you can modify internals (prefer full fine-tuning or intrusive PEFT).
  • When base model is small or not instruction-tuned and you cannot afford adapter search.
  • When the task requires changing internal attention or encoder weights (e.g., low-level vision encoder fixes).

Failure Modes

  • Underperformance on non-instruction-tuned 'raw' bases (observed avg gaps ≈2.3–2.6 points).
  • Insufficient adapter capacity harms linguistically hard tasks (CoLA noted as gap).
  • Misrouting or incorrect adapter selection at serving time could produce wrong behavior if adapters are not managed.

Core Entities

Models

  • PaLI-X
  • T5 (11B)
  • FLAN (11B)
  • ViT
  • AdaLink
  • LoRA
  • Prompt Tuning
  • Adapters

Metrics

  • CIDEr
  • Accuracy
  • ANLS
  • Pearson
  • Matthews_corr

Datasets

  • COCO
  • TextCaps
  • OK-VQA
  • DocVQA
  • TextVQA
  • ST-VQA
  • GLUE

Benchmarks

  • COCO Captions
  • TextCaps
  • OK-VQA
  • DocVQA
  • TextVQA
  • ST-VQA
  • GLUE

Context Entities

Models

  • LoRA
  • Adapters
  • Prompt Tuning
  • MMIT (multimodal instruction-tuned variant)

Datasets

  • Self-Instruct (used to create MMIT tasks)