GRIFFIN: training-free sequence-level neuron selection that cuts FF work by 50% and speeds up generation

Overview

Decision SnapshotReady For Pilot

The method is simple, training-free, and tested on several open LLMs and tasks; it works well at 50% FF sparsity but needs prompt-length tuning and batch-size checks before production.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Harry Dong, Beidi Chen, Yuejie Chi

Links

Abstract / PDF / Code

Why It Matters For Business

GRIFFIN cuts active FF work by half with no retraining, offering real latency and memory wins for deploy-time generation while preserving most task quality on evaluated models and datasets.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Founder

Summary TLDR

GRIFFIN is a training-free method that picks which feed‑forward (FF) neurons to run per input sequence by looking at the prompt. It exploits a phenomenon called "flocking" (tokens in a sequence share relative neuron activations) to keep about half of FF neurons inactive during generation. On many models and tasks GRIFFIN preserves task quality at 50% FF sparsity while lowering latency (e.g., 1.29× on Gemma 7B, 1.25× on Llama 2 13B) and reducing active parameters (Llama 2 13B: 13B→8.8B). Code is public.

Problem Statement

Large transformer models waste a lot of compute in feedforward (FF) layers because many intermediate neurons contribute little per token. Existing fixes (structured pruning, MoEs) either require training, fail with non-ReLU activations, or are hard to deploy. The paper asks: can we adaptively skip FF neurons per sequence, without training, across many LLMs and activation types?

Main Contribution

Identify "flocking": within a sequence, relative FF neuron activations are highly consistent across tokens.

GRIFFIN: a training-free, prompt-based top-k selector that chooses FF neurons per sequence and reuses them during generation.

Key Findings

GRIFFIN keeps performance near the full model at 50% FF sparsity on classification tasks.

NumbersHellaSwag Llama 2 7B: 57.16 -> 57.11 accuracy (full -> GRIFFIN)

Practical UseYou can disable half of FF neurons during generation without hurting many classification outputs; try 50% pruning first.

Evidence RefTable 1

GRIFFIN preserves much of generation quality at 50% FF sparsity on summarization and QA.

NumbersXSum Rouge-1: Llama 2 7B 27.15 -> 24.75; Gemma 7B 26.86 -> 25.86

Practical UseExpect modest drops in summarization scores but often acceptable trade-offs for latency—validate on your target task.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Efficient Inference	Llama2 13B: 13B -> 8.8B; Gemma 7B: 8.5B -> 5.4B	full model	≈33% reduction	generation phase	Section 4.2 and 5.2	Sections 4.2, 5.2
Latency (long generation)	Gemma 7B 1.29×, Llama 2 13B 1.25× speed-up	full model	up to 29% faster	synthetic long generation on NVIDIA L40	Table 3, Section 5.2	Table 3

What To Try In 7 Days

Run GRIFFIN at 50% FF sparsity on one production LLM and measure latency, memory, and task metrics.

Compare prompt lengths: increase prompt size to reduce long-generation quality loss.

Test batched vs. single-sample throughput and confirm whether pruned model fits a single device to avoid offload.

Optimization Features

Token Efficiency

Lower compute per generated token due to fewer active neurons

Infra Optimization

Reduces memory footprint of FF layers during generation

Model Optimization

Sequence-level structured pruning (top-k neurons from prompt)Adaptive per-sequence expert neuron selection (no training)

System Optimization

Enables fitting pruned model on single device to avoid offloadBest for single-sample, latency-sensitive inference

Inference Optimization

Reduces active FF dimensions during generationWorks with non-ReLU activations (SwiGLU, GEGLU, ReGLU)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/hdong920/GRIFFIN

Risks & Boundaries

Limitations

Performance degrades for very long generations when prompt is short; longer prompts help.

Benefits shrink as batch size grows; best suited to batch size 1 or small batches.

When Not To Use

Workloads with extremely long uncontrolled generation and short prompts where neuron patterns drift.

High-throughput large-batch serving where batch-level aggregation reduces adaptivity gains.

Failure Modes

Prompt is not representative and selected neurons misalign with later generation.

Sampling-based neuron selection (instead of top-k) substantially reduces quality.

Core Entities

Models

Llama 2GemmaMistralOPTReluLlama

Metrics

Rouge-1/2/LF1ExactMatchAccuracyLatency (s)Active parameter count

Datasets

WikiTextXSumCNN/DailyMailCoQAQASPERHellaSwagPIQACOPAARC-eARC-cBoolQ

Benchmarks

XSumCNN/DailyMailCoQAQASPERHellaSwagPIQACOPAARCBoolQ

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GRIFFIN keeps performance near the full model at 50% FF sparsity on classification tasks.

GRIFFIN preserves much of generation quality at 50% FF sparsity on summarization and QA.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding