Overview
The method is technically simple and plugs into decoding: monitor cosine similarity per-token, warm up first tokens, then greedily skip FFNs. Evidence shows small quality loss at ~25% skip on tested tasks; calibration is needed per model and workload.
Citations0
Evidence Strength0.78
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 75%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
FFN-SkipLLM can cut a sizeable fraction of per-token compute (targeting the heaviest FFN blocks) while preserving accuracy on knowledge and conversational tasks, letting teams lower inference cost and latency without rewriting KV-cache logic.
Who Should Care
Summary TLDR
FFN-SkipLLM is a simple, input-adaptive decoding trick: measure cosine similarity before vs after each feed‑forward (FFN) block and skip FFNs once similarity saturates. Skipping only FFNs avoids Key-Value (KV) cache problems that hurt layer-level skipping. On LLaMa variants and knowledge tasks (Factoid-QA, MT-Bench, CNN/DailyMail summarization) the method can skip about 25–30% of FFN blocks with only small task-specific drops (often <1 percentage point) and fewer hallucinations than prior layer-skipping methods. The method needs a short warm-up (first tokens) and calibration to avoid 'cold' regions; performance degrades at ≥35% skipping.
Problem Statement
Layer-level early exit or dropping speeds token-by-token decoding but breaks KV caches and often causes hallucination or repeated tokens. FFN blocks contain around two thirds of layer parameters and may be redundant even when attention is needed. The paper asks: can we skip FFN blocks adaptively to cut compute while avoiding KV-cache damage and preserving factual generation?
Main Contribution
Introduce FFN-SkipLLM: an input-adaptive policy that skips FFN blocks using a cosine-similarity test between tokens entering and exiting an FFN.
Show skipping only FFNs circumvents KV cache issues tied to whole-layer skipping.
Key Findings
FFN blocks hold about two thirds of a transformer's layer parameters.
Token representations before vs after FFN blocks show high and monotonically increasing cosine similarity in middle layers.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 78.09% | Full Model 79.02% | −0.93pp | FreebaseQA, ~25% FFN skip | Table 3 (Ours at ~25% skip) | Table 3 |
| Multi-turn conversation (GPT-4 judge score) | 7.55 | Full Model 7.61 | −0.06 | MT-Bench style, ~20% skip | Table 2 (Ours at ~20% skip) | Table 2 |
What To Try In 7 Days
Measure cosine similarity before/after FFN blocks on your model to identify redundant middle layers.
Add a warm-up of first few tokens (5–10% of max length) before enabling FFN skipping.
Implement greedy FFN skipping with a sim threshold to aim for ~25% FFN skip and compare QA/summarization quality vs full model.
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Performance drops significantly at high skip ratios (≥35%) according to the paper.
Requires a small calibration set to identify cold regions and a warm-up token count.
When Not To Use
When you need aggressive skipping (>35%) without fine-tuning.
When absolute factual precision is required and any drop is unacceptable.
Failure Modes
Skipping inside identified 'cold' regions causes hallucination and token collapse.
Greedy skipping with an inappropriate similarity threshold can erase needed FFN transformations and degrade outputs.

