LOOKAT: 64× KV-cache compression via lookup-table attention, no retraining

January 15, 20266 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Aryan Karmore

Links

Abstract / PDF

Why It Matters For Business

LOOKAT can cut KV-cache memory and DRAM bandwidth on edge devices by tens of times without retraining, enabling larger context or lower-cost hardware for real-time inference.

Summary TLDR

LOOKAT replaces standard attention scoring with a lookup-table method built from product quantization and asymmetric distance computation (ADC). By compressing keys into small codebook indices and precomputing query×codebook dot-products, LOOKAT avoids dequantizing keys and cuts KV-cache bandwidth drastically. On GPT-2 experiments it reaches 64× compression with ~0.957 cosine output fidelity and Spearman rank correlation >0.95, without training or architecture changes. Main limits: values stay FP16 and lookup kernels need hardware support.

Problem Statement

KV-cache memory grows with sequence length and dominates edge inference memory. Standard INT4/INT8 quantization lowers storage but still needs dequantization, so bandwidth remains the bottleneck. The paper asks: can attention scoring be computed directly on compressed keys to remove the dequantization bandwidth cost?

Main Contribution

Show attention scoring is equivalent to inner-product retrieval and can use product quantization + ADC.

Introduce LOOKAT: compute attention scores from codebook indices via precomputed lookup tables, avoiding key dequantization.

Demonstrate up to 64× KV-cache compression with >95% output fidelity on GPT-2 without retraining or architectural changes.

Provide theoretical bound linking rank-correlation degradation to PQ parameters and empirical scaling up to 1024 tokens.

Key Findings

LOOKAT achieves 64× KV-cache compression while keeping model output close to FP16.

Numbers64× compression → cosine sim 0.957

LOOKAT preserves attention ranking structure.

NumbersSpearman ρ ≈ 0.95–0.96 across configs

LOOKAT removes the dequantization bandwidth bottleneck by using ADC lookup tables.

NumbersPer-key DRAM load: 4 bytes vs. 128 bytes (32× less bandwidth for m=4)

Quality degrades with longer contexts but remains usable up to 1024 tokens.

NumbersCosine sim drops from 0.999 (L=64) to 0.903 (L=1024)

Results

LOOKAT-2 cosine similarity (output fidelity)

Value0.957

BaselineFP16 baseline 1.000

LOOKAT-4 cosine similarity (output fidelity)

Value0.950

BaselineFP16 baseline 1.000

INT4 cosine similarity (scalar quant baseline)

Value0.987

BaselineFP16 baseline 1.000

Spearman rank correlation (LOOKAT range)

Value≈0.957–0.960

BaselineFP16 baseline 1.000

Long-context cosine similarity (LOOKAT-4)

Value0.903 at L=1024

Baseline0.999 at L=64

Who Should Care

What To Try In 7 Days

Profile your model's KV-cache bandwidth; confirm attention scoring is the bottleneck.

Implement a PQ+ADC prototype for keys (m=4) and measure per-query DRAM load reduction.

Run functional checks on a few representative prompts to compare top-k attention overlap with FP16.

Optimization Features

Token Efficiency

  • reduces per-token KV memory to 1–16 bytes depending on m

Infra Optimization

  • suitable for low-bandwidth edge DRAM; reduces DRAM traffic for attention

Model Optimization

  • product quantization of key vectors
  • asymmetric distance computation for scoring

System Optimization

  • shifts bottleneck from memory bandwidth to small compute and memory for tables
  • requires optimized lookup kernels on target hardware

Training Optimization

  • no retraining required (post-processing codebooks)

Inference Optimization

  • precompute lookup tables per query (m × 256 dot-products)
  • replace per-key FP16 loads with byte indices + lookups

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Only compresses keys; values remain FP16 so total memory savings are partial.
  • Codebook quality depends on calibration data and domain; results reported on three small sample types.
  • Needs hardware-optimized lookup kernels on NPUs/DSPs to realize latency gains.
  • Experiments use GPT-2 first layer and small sample counts; large-model behavior remains untested.

When Not To Use

  • When you need exact attention magnitudes rather than relative ordering.
  • If your hardware cannot support fast table lookups or optimized byte-index kernels.
  • When value compression is required but not yet implemented.

Failure Modes

  • Long contexts degrade fidelity (cosine drops ~10% at 1024 tokens).
  • High KL divergence in some samples can alter attention mass distribution.
  • Poor codebook fit (mismatch between calibration and runtime data) reduces accuracy.

Core Entities

Models

  • GPT-2

Metrics

  • Cosine Similarity
  • KL Divergence
  • Spearman Rank Correlation
  • Accuracy

Datasets

  • natural language prose (samples)
  • Python source code (samples)
  • technical mixed text (samples)