Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Overview

Decision SnapshotReady For Pilot

The method is technically simple and plugs into decoding: monitor cosine similarity per-token, warm up first tokens, then greedily skip FFNs. Evidence shows small quality loss at ~25% skip on tested tasks; calibration is needed per model and workload.

Citations0

Evidence Strength0.78

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 65%

Authors

Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella

Links

Abstract / PDF

Why It Matters For Business

FFN-SkipLLM can cut a sizeable fraction of per-token compute (targeting the heaviest FFN blocks) while preserving accuracy on knowledge and conversational tasks, letting teams lower inference cost and latency without rewriting KV-cache logic.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

FFN-SkipLLM is a simple, input-adaptive decoding trick: measure cosine similarity before vs after each feed‑forward (FFN) block and skip FFNs once similarity saturates. Skipping only FFNs avoids Key-Value (KV) cache problems that hurt layer-level skipping. On LLaMa variants and knowledge tasks (Factoid-QA, MT-Bench, CNN/DailyMail summarization) the method can skip about 25–30% of FFN blocks with only small task-specific drops (often <1 percentage point) and fewer hallucinations than prior layer-skipping methods. The method needs a short warm-up (first tokens) and calibration to avoid 'cold' regions; performance degrades at ≥35% skipping.

Problem Statement

Layer-level early exit or dropping speeds token-by-token decoding but breaks KV caches and often causes hallucination or repeated tokens. FFN blocks contain around two thirds of layer parameters and may be redundant even when attention is needed. The paper asks: can we skip FFN blocks adaptively to cut compute while avoiding KV-cache damage and preserving factual generation?

Main Contribution

Introduce FFN-SkipLLM: an input-adaptive policy that skips FFN blocks using a cosine-similarity test between tokens entering and exiting an FFN.

Show skipping only FFNs circumvents KV cache issues tied to whole-layer skipping.

Key Findings

FFN blocks hold about two thirds of a transformer's layer parameters.

Numbers~66% of layer params (LLaMa-7B table)

Practical UseSkipping FFN blocks targets the heaviest compute portion and yields meaningful compute reduction without touching attention/KV logic.

Evidence RefTable 1

Token representations before vs after FFN blocks show high and monotonically increasing cosine similarity in middle layers.

Practical UseHigh similarity flags redundant FFN work; use a similarity threshold to decide when to skip FFNs per input token.

Evidence RefFigure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	78.09%	Full Model 79.02%	−0.93pp	FreebaseQA, ~25% FFN skip	Table 3 (Ours at ~25% skip)	Table 3
Multi-turn conversation (GPT-4 judge score)	7.55	Full Model 7.61	−0.06	MT-Bench style, ~20% skip	Table 2 (Ours at ~20% skip)	Table 2

What To Try In 7 Days

Measure cosine similarity before/after FFN blocks on your model to identify redundant middle layers.

Add a warm-up of first few tokens (5–10% of max length) before enabling FFN skipping.

Implement greedy FFN skipping with a sim threshold to aim for ~25% FFN skip and compare QA/summarization quality vs full model.

Optimization Features

Token Efficiency

Saves compute per generated token by ignoring FFNs once similarity threshold reached

Model Optimization

FFN block skipping (parameter-level compute reduction)

System Optimization

Avoids KV cache recomputation and hidden-state copying needed by layer-skip

Inference Optimization

Input-adaptive per-token skipping of FFN blocksWarm-up first tokens to stabilize KV cache

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Performance drops significantly at high skip ratios (≥35%) according to the paper.

Requires a small calibration set to identify cold regions and a warm-up token count.

When Not To Use

When you need aggressive skipping (>35%) without fine-tuning.

When absolute factual precision is required and any drop is unacceptable.

Failure Modes

Skipping inside identified 'cold' regions causes hallucination and token collapse.

Greedy skipping with an inappropriate similarity threshold can erase needed FFN transformations and degrade outputs.

Core Entities

Models

LLaMa-2 7BLLaMa-2 13BLLaMa-chat-13B

Metrics

AccuracyMT-Bench GPT-4 score (0-10)GPT-4 summarization ranking

Datasets

FreebaseQACNN/DailyMailWikitextC4

Benchmarks

Factoid-QAMT-Bench (multi-turn conversation)In-context summarization (variable-length)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FFN blocks hold about two thirds of a transformer's layer parameters.

Token representations before vs after FFN blocks show high and monotonically increasing cosine similarity in middle layers.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Use Monte Carlo tree search to coordinate multiple LLMs and get more compute-efficient synthetic outputs

Key finding