Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

April 5, 20247 min

Overview

Decision SnapshotReady For Pilot

The method is technically simple and plugs into decoding: monitor cosine similarity per-token, warm up first tokens, then greedily skip FFNs. Evidence shows small quality loss at ~25% skip on tested tasks; calibration is needed per model and workload.

Citations0

Evidence Strength0.78

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 65%

Authors

Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella

Links

Abstract / PDF

Why It Matters For Business

FFN-SkipLLM can cut a sizeable fraction of per-token compute (targeting the heaviest FFN blocks) while preserving accuracy on knowledge and conversational tasks, letting teams lower inference cost and latency without rewriting KV-cache logic.

Who Should Care

Summary TLDR

FFN-SkipLLM is a simple, input-adaptive decoding trick: measure cosine similarity before vs after each feed‑forward (FFN) block and skip FFNs once similarity saturates. Skipping only FFNs avoids Key-Value (KV) cache problems that hurt layer-level skipping. On LLaMa variants and knowledge tasks (Factoid-QA, MT-Bench, CNN/DailyMail summarization) the method can skip about 25–30% of FFN blocks with only small task-specific drops (often <1 percentage point) and fewer hallucinations than prior layer-skipping methods. The method needs a short warm-up (first tokens) and calibration to avoid 'cold' regions; performance degrades at ≥35% skipping.

Problem Statement

Layer-level early exit or dropping speeds token-by-token decoding but breaks KV caches and often causes hallucination or repeated tokens. FFN blocks contain around two thirds of layer parameters and may be redundant even when attention is needed. The paper asks: can we skip FFN blocks adaptively to cut compute while avoiding KV-cache damage and preserving factual generation?

Main Contribution

Introduce FFN-SkipLLM: an input-adaptive policy that skips FFN blocks using a cosine-similarity test between tokens entering and exiting an FFN.

Show skipping only FFNs circumvents KV cache issues tied to whole-layer skipping.

Key Findings

FFN blocks hold about two thirds of a transformer's layer parameters.

Numbers~66% of layer params (LLaMa-7B table)

Practical UseSkipping FFN blocks targets the heaviest compute portion and yields meaningful compute reduction without touching attention/KV logic.

Evidence RefTable 1

Token representations before vs after FFN blocks show high and monotonically increasing cosine similarity in middle layers.

Practical UseHigh similarity flags redundant FFN work; use a similarity threshold to decide when to skip FFNs per input token.

Evidence RefFigure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy78.09%Full Model 79.02%−0.93ppFreebaseQA, ~25% FFN skipTable 3 (Ours at ~25% skip)Table 3
Multi-turn conversation (GPT-4 judge score)7.55Full Model 7.61−0.06MT-Bench style, ~20% skipTable 2 (Ours at ~20% skip)Table 2

What To Try In 7 Days

Measure cosine similarity before/after FFN blocks on your model to identify redundant middle layers.

Add a warm-up of first few tokens (5–10% of max length) before enabling FFN skipping.

Implement greedy FFN skipping with a sim threshold to aim for ~25% FFN skip and compare QA/summarization quality vs full model.

Optimization Features

Token Efficiency
Saves compute per generated token by ignoring FFNs once similarity threshold reached
Model Optimization
FFN block skipping (parameter-level compute reduction)
System Optimization
Avoids KV cache recomputation and hidden-state copying needed by layer-skip
Inference Optimization
Input-adaptive per-token skipping of FFN blocksWarm-up first tokens to stabilize KV cache

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Performance drops significantly at high skip ratios (≥35%) according to the paper.

Requires a small calibration set to identify cold regions and a warm-up token count.

When Not To Use

When you need aggressive skipping (>35%) without fine-tuning.

When absolute factual precision is required and any drop is unacceptable.

Failure Modes

Skipping inside identified 'cold' regions causes hallucination and token collapse.

Greedy skipping with an inappropriate similarity threshold can erase needed FFN transformations and degrade outputs.

Core Entities

Models

LLaMa-2 7BLLaMa-2 13BLLaMa-chat-13B

Metrics

AccuracyMT-Bench GPT-4 score (0-10)GPT-4 summarization ranking

Datasets

FreebaseQACNN/DailyMailWikitextC4

Benchmarks

Factoid-QAMT-Bench (multi-turn conversation)In-context summarization (variable-length)