Overview
The method is simple, training-free, and tested on several open LLMs and tasks; it works well at 50% FF sparsity but needs prompt-length tuning and batch-size checks before production.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
GRIFFIN cuts active FF work by half with no retraining, offering real latency and memory wins for deploy-time generation while preserving most task quality on evaluated models and datasets.
Who Should Care
Summary TLDR
GRIFFIN is a training-free method that picks which feed‑forward (FF) neurons to run per input sequence by looking at the prompt. It exploits a phenomenon called "flocking" (tokens in a sequence share relative neuron activations) to keep about half of FF neurons inactive during generation. On many models and tasks GRIFFIN preserves task quality at 50% FF sparsity while lowering latency (e.g., 1.29× on Gemma 7B, 1.25× on Llama 2 13B) and reducing active parameters (Llama 2 13B: 13B→8.8B). Code is public.
Problem Statement
Large transformer models waste a lot of compute in feedforward (FF) layers because many intermediate neurons contribute little per token. Existing fixes (structured pruning, MoEs) either require training, fail with non-ReLU activations, or are hard to deploy. The paper asks: can we adaptively skip FF neurons per sequence, without training, across many LLMs and activation types?
Main Contribution
Identify "flocking": within a sequence, relative FF neuron activations are highly consistent across tokens.
GRIFFIN: a training-free, prompt-based top-k selector that chooses FF neurons per sequence and reuses them during generation.
Key Findings
GRIFFIN keeps performance near the full model at 50% FF sparsity on classification tasks.
GRIFFIN preserves much of generation quality at 50% FF sparsity on summarization and QA.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Efficient Inference | Llama2 13B: 13B -> 8.8B; Gemma 7B: 8.5B -> 5.4B | full model | ≈33% reduction | generation phase | Section 4.2 and 5.2 | Sections 4.2, 5.2 |
| Latency (long generation) | Gemma 7B 1.29×, Llama 2 13B 1.25× speed-up | full model | up to 29% faster | synthetic long generation on NVIDIA L40 | Table 3, Section 5.2 | Table 3 |
What To Try In 7 Days
Run GRIFFIN at 50% FF sparsity on one production LLM and measure latency, memory, and task metrics.
Compare prompt lengths: increase prompt size to reduce long-generation quality loss.
Test batched vs. single-sample throughput and confirm whether pruned model fits a single device to avoid offload.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Performance degrades for very long generations when prompt is short; longer prompts help.
Benefits shrink as batch size grows; best suited to batch size 1 or small batches.
When Not To Use
Workloads with extremely long uncontrolled generation and short prompts where neuron patterns drift.
High-throughput large-batch serving where batch-level aggregation reduces adaptivity gains.
Failure Modes
Prompt is not representative and selected neurons misalign with later generation.
Sampling-based neuron selection (instead of top-k) substantially reduces quality.

