LOOKAT: 64× KV-cache compression via lookup-table attention, no retraining

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Aryan Karmore

Links

Abstract / PDF

Why It Matters For Business

LOOKAT can cut KV-cache memory and DRAM bandwidth on edge devices by tens of times without retraining, enabling larger context or lower-cost hardware for real-time inference.

Summary TLDR

LOOKAT replaces standard attention scoring with a lookup-table method built from product quantization and asymmetric distance computation (ADC). By compressing keys into small codebook indices and precomputing query×codebook dot-products, LOOKAT avoids dequantizing keys and cuts KV-cache bandwidth drastically. On GPT-2 experiments it reaches 64× compression with ~0.957 cosine output fidelity and Spearman rank correlation >0.95, without training or architecture changes. Main limits: values stay FP16 and lookup kernels need hardware support.

Problem Statement

KV-cache memory grows with sequence length and dominates edge inference memory. Standard INT4/INT8 quantization lowers storage but still needs dequantization, so bandwidth remains the bottleneck. The paper asks: can attention scoring be computed directly on compressed keys to remove the dequantization bandwidth cost?

Main Contribution

Show attention scoring is equivalent to inner-product retrieval and can use product quantization + ADC.

Introduce LOOKAT: compute attention scores from codebook indices via precomputed lookup tables, avoiding key dequantization.

Demonstrate up to 64× KV-cache compression with >95% output fidelity on GPT-2 without retraining or architectural changes.

Provide theoretical bound linking rank-correlation degradation to PQ parameters and empirical scaling up to 1024 tokens.

Key Findings

LOOKAT achieves 64× KV-cache compression while keeping model output close to FP16.

Numbers64× compression → cosine sim 0.957

LOOKAT preserves attention ranking structure.

NumbersSpearman ρ ≈ 0.95–0.96 across configs

LOOKAT removes the dequantization bandwidth bottleneck by using ADC lookup tables.

NumbersPer-key DRAM load: 4 bytes vs. 128 bytes (32× less bandwidth for m=4)

Quality degrades with longer contexts but remains usable up to 1024 tokens.

NumbersCosine sim drops from 0.999 (L=64) to 0.903 (L=1024)

Results

LOOKAT-2 cosine similarity (output fidelity)

Value0.957

BaselineFP16 baseline 1.000

LOOKAT-4 cosine similarity (output fidelity)

Value0.950

BaselineFP16 baseline 1.000

INT4 cosine similarity (scalar quant baseline)

Value0.987

BaselineFP16 baseline 1.000

Spearman rank correlation (LOOKAT range)

Value≈0.957–0.960

BaselineFP16 baseline 1.000

Long-context cosine similarity (LOOKAT-4)

Value0.903 at L=1024

Baseline0.999 at L=64

Who Should Care

CtoMl EngineerEngineering LeadProduct Manager

What To Try In 7 Days

Profile your model's KV-cache bandwidth; confirm attention scoring is the bottleneck.

Implement a PQ+ADC prototype for keys (m=4) and measure per-query DRAM load reduction.

Run functional checks on a few representative prompts to compare top-k attention overlap with FP16.

Optimization Features

Token Efficiency

reduces per-token KV memory to 1–16 bytes depending on m

Infra Optimization

suitable for low-bandwidth edge DRAM; reduces DRAM traffic for attention

Model Optimization

product quantization of key vectors
asymmetric distance computation for scoring

System Optimization

shifts bottleneck from memory bandwidth to small compute and memory for tables
requires optimized lookup kernels on target hardware

Training Optimization

no retraining required (post-processing codebooks)

Inference Optimization

precompute lookup tables per query (m × 256 dot-products)
replace per-key FP16 loads with byte indices + lookups

Reproducibility

Open Source Status

unknown

Risks & Boundaries

Limitations

Only compresses keys; values remain FP16 so total memory savings are partial.
Codebook quality depends on calibration data and domain; results reported on three small sample types.
Needs hardware-optimized lookup kernels on NPUs/DSPs to realize latency gains.
Experiments use GPT-2 first layer and small sample counts; large-model behavior remains untested.

When Not To Use

When you need exact attention magnitudes rather than relative ordering.
If your hardware cannot support fast table lookups or optimized byte-index kernels.
When value compression is required but not yet implemented.

Failure Modes

Long contexts degrade fidelity (cosine drops ~10% at 1024 tokens).
High KL divergence in some samples can alter attention mass distribution.
Poor codebook fit (mismatch between calibration and runtime data) reduces accuracy.

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LOOKAT achieves 64× KV-cache compression while keeping model output close to FP16.

LOOKAT preserves attention ranking structure.

LOOKAT removes the dequantization bandwidth bottleneck by using ADC lookup tables.

Quality degrades with longer contexts but remains usable up to 1024 tokens.

Results

LOOKAT-2 cosine similarity (output fidelity)

LOOKAT-4 cosine similarity (output fidelity)

INT4 cosine similarity (scalar quant baseline)

Spearman rank correlation (LOOKAT range)

Long-context cosine similarity (LOOKAT-4)

Who Should Care

What To Try In 7 Days

Optimization Features

Token Efficiency

Infra Optimization

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Related Papers