Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.75
Citation Count
0
Why It Matters For Business
FFN-SkipLLM can cut a sizeable fraction of per-token compute (targeting the heaviest FFN blocks) while preserving accuracy on knowledge and conversational tasks, letting teams lower inference cost and latency without rewriting KV-cache logic.
Summary TLDR
FFN-SkipLLM is a simple, input-adaptive decoding trick: measure cosine similarity before vs after each feed‑forward (FFN) block and skip FFNs once similarity saturates. Skipping only FFNs avoids Key-Value (KV) cache problems that hurt layer-level skipping. On LLaMa variants and knowledge tasks (Factoid-QA, MT-Bench, CNN/DailyMail summarization) the method can skip about 25–30% of FFN blocks with only small task-specific drops (often <1 percentage point) and fewer hallucinations than prior layer-skipping methods. The method needs a short warm-up (first tokens) and calibration to avoid 'cold' regions; performance degrades at ≥35% skipping.
Problem Statement
Layer-level early exit or dropping speeds token-by-token decoding but breaks KV caches and often causes hallucination or repeated tokens. FFN blocks contain around two thirds of layer parameters and may be redundant even when attention is needed. The paper asks: can we skip FFN blocks adaptively to cut compute while avoiding KV-cache damage and preserving factual generation?
Main Contribution
Introduce FFN-SkipLLM: an input-adaptive policy that skips FFN blocks using a cosine-similarity test between tokens entering and exiting an FFN.
Show skipping only FFNs circumvents KV cache issues tied to whole-layer skipping.
Provide a warm-up (first tokens) and cold-region calibration to protect layers that strongly change representations.
Empirically evaluate on Factoid-QA, multi-turn conversation (MT-Bench style), and variable-length summarization; show ~25–30% FFN skip with marginal quality loss and reduced hallucination versus layer-skip baselines.
Key Findings
FFN blocks hold about two thirds of a transformer's layer parameters.
Token representations before vs after FFN blocks show high and monotonically increasing cosine similarity in middle layers.
FFN-SkipLLM can skip ~25–30% of FFN blocks with small quality loss on knowledge tasks.
Compared to layer-skipping methods, FFN-SkipLLM preserves factual accuracy and avoids token collapse at similar skip rates.
Results
Accuracy
Multi-turn conversation (GPT-4 judge score)
In-context summarization (GPT-4 ranking)
Who Should Care
What To Try In 7 Days
Measure cosine similarity before/after FFN blocks on your model to identify redundant middle layers.
Add a warm-up of first few tokens (5–10% of max length) before enabling FFN skipping.
Implement greedy FFN skipping with a sim threshold to aim for ~25% FFN skip and compare QA/summarization quality vs full model.
Optimization Features
Token Efficiency
- Saves compute per generated token by ignoring FFNs once similarity threshold reached
Model Optimization
- FFN block skipping (parameter-level compute reduction)
System Optimization
- Avoids KV cache recomputation and hidden-state copying needed by layer-skip
Inference Optimization
- Input-adaptive per-token skipping of FFN blocks
- Warm-up first tokens to stabilize KV cache
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Performance drops significantly at high skip ratios (≥35%) according to the paper.
- Requires a small calibration set to identify cold regions and a warm-up token count.
- Evaluations are on LLaMa family and common datasets; results may vary for other architectures or domain data.
When Not To Use
- When you need aggressive skipping (>35%) without fine-tuning.
- When absolute factual precision is required and any drop is unacceptable.
- When your model architecture or layer layout differs substantially from tested LLaMa variants.
Failure Modes
- Skipping inside identified 'cold' regions causes hallucination and token collapse.
- Greedy skipping with an inappropriate similarity threshold can erase needed FFN transformations and degrade outputs.
- Random or non-adaptive FFN dropping produces large performance losses (≥50% in extreme cases reported).
Core Entities
Models
- LLaMa-2 7B
- LLaMa-2 13B
- LLaMa-chat-13B
Metrics
- Accuracy
- MT-Bench GPT-4 score (0-10)
- GPT-4 summarization ranking
Datasets
- FreebaseQA
- CNN/DailyMail
- Wikitext
- C4
Benchmarks
- Factoid-QA
- MT-Bench (multi-turn conversation)
- In-context summarization (variable-length)

