Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
5
Why It Matters For Business
AdaLink cuts adaptation cost and serving complexity by tuning tiny adapters instead of full models, letting teams deploy many task-specific behaviors without copying huge models.
Summary TLDR
This paper introduces AdaLink, a lightweight, input-focused adapter placed between embeddings and transformer blocks. AdaLink tunes only a tiny fraction of parameters (e.g., ~1.05M vs 32B) and keeps the core model frozen. On large multimodal and NLU checkpoints (PaLI-X, T5/FLAN), AdaLink often matches or closely approaches full fine-tuning, beats prompt-tuning, scales linearly with embedding size, and supports per-task and per-modality adapters for safer serving.
Problem Statement
Fine-tuning huge LLMs and VLMs per task is expensive and hard to serve. Intrusive PEFT methods alter internals and complicate deployment. Non-intrusive methods like prompt tuning are easy to serve but often underperform or are unstable. The paper asks: can a non-intrusive, input-centric adapter reach near full-finetune quality while keeping serving simple?
Main Contribution
AdaLink: a non-intrusive adapter placed after embeddings and before transformer blocks (two-layer low-rank MLP).
Showed AdaLink matches or nearly matches full-model fine-tuning on multimodal captioning and VQA when using large or instruction-tuned bases.
Showed AdaLink outperforms prompt tuning within non-intrusive methods and is stable across adapter ranks.
Demonstrated modality-specific adapters reduce interference and keep serving/config complexity low.
Key Findings
AdaLink reaches near full fine-tuning on COCO captioning with instruction-tuned base.
AdaLink outperforms prompt tuning on image captioning.
Instruction-tuned base models shrink the gap to full fine-tuning.
AdaLink gives competitive NLU performance while tuning very few parameters.
Results
CIDEr
CIDEr
Avg δ to FT (VQA)
Avg δ to FT (VQA)
GLUE avg
Who Should Care
What To Try In 7 Days
Attach modality-specific AdaLink modules to your base model and run a small validation on a representative multimodal task.
If available, start from an instruction-tuned checkpoint (FLAN/MMIT) to shrink the performance gap quickly.
Replace existing prompt-tuning experiments with AdaLink and compare CIDEr/accuracy on a small holdout.
Optimization Features
Token Efficiency
- Does not add soft prompt tokens
Infra Optimization
- Smaller tunable-weights footprint than intrusive PEFT
Model Optimization
- Parameter-efficient fine-tuning
- Low-rank adapters
System Optimization
- Configurable serving as intermediate unit or embedding transform
Training Optimization
- Instruction-tuning reduces adapter size needs
- Per-modality adapters to isolate interference
Inference Optimization
- No change to transformer internals (simpler serving)
- Computational cost linear in embedding dim
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Less expressive than intrusive PEFT or full fine-tuning; small residual gaps remain on some raw checkpoints.
- Performance improves markedly with instruction-tuned bases; raw bases can show larger gaps (e.g., up to ~2.6 points on some VQA metrics).
- Adds adapter parameters and storage per task/modality (trade-off vs copying whole model).
- Doesn't remove need to freeze or tune upstream encoders (paper froze ViT in experiments).
When Not To Use
- When absolute best possible metric is required and you can modify internals (prefer full fine-tuning or intrusive PEFT).
- When base model is small or not instruction-tuned and you cannot afford adapter search.
- When the task requires changing internal attention or encoder weights (e.g., low-level vision encoder fixes).
Failure Modes
- Underperformance on non-instruction-tuned 'raw' bases (observed avg gaps ≈2.3–2.6 points).
- Insufficient adapter capacity harms linguistically hard tasks (CoLA noted as gap).
- Misrouting or incorrect adapter selection at serving time could produce wrong behavior if adapters are not managed.
Core Entities
Models
- PaLI-X
- T5 (11B)
- FLAN (11B)
- ViT
- AdaLink
- LoRA
- Prompt Tuning
- Adapters
Metrics
- CIDEr
- Accuracy
- ANLS
- Pearson
- Matthews_corr
Datasets
- COCO
- TextCaps
- OK-VQA
- DocVQA
- TextVQA
- ST-VQA
- GLUE
Benchmarks
- COCO Captions
- TextCaps
- OK-VQA
- DocVQA
- TextVQA
- ST-VQA
- GLUE
Context Entities
Models
- LoRA
- Adapters
- Prompt Tuning
- MMIT (multimodal instruction-tuned variant)
Datasets
- Self-Instruct (used to create MMIT tasks)

